[ 
https://issues.apache.org/jira/browse/TRAFODION-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Zhenxing updated TRAFODION-3265:
-----------------------------------
    Description: 
When inserting values to varchar column of UTF8 character set, and the values 
are a mixture of alphabet and Chinese (or other none-ISO8859-1 characters), the 
performance will degrade exponentially according to the number of alphabet 
characters.
{noformat}
>> CREATE TABLE t1 (a VARCHAR(65535) CHARACTER SET UTF8);
>> INSERT INTO t1 VALUES ('aaa...中文');
{noformat}
Here is a simple test result with different number of 'a's in the insert value:
{noformat}
100  'a's                 10 ms
1k   'a's                 100 ms
10k  'a's                 10 seconds
60k  'a's                 160 seconds{noformat}
The reason is because we internally convert UTF8 to UCS2, which means all 'a's 
will become "a\0"s, and then ConstValue::getText() function will try to replace 
all '\0' to "\\0", and thus cause the performance degrade exponentially with 
the number of 'a's in the string.
{code:java}
const NAString ConstValue::getText() const
 {
 if(getType()->getTypeQualifier() == NA_CHARACTER_TYPE)
 {
 ...

// Change imbedded NULL and \377 chars to \0 and \377
 // This comes up in key values quite often.
 size_t index;
 while((index = result.first('\0')) != NA_NPOS
 && index != result.length())
 result(index,1) = "
0";
 while((index = result.first('\377')) != NA_NPOS
 && index != result.length())
 result(index,1) = "
377";

...
 }

{code}

  was:
When inserting values to varchar column of UTF8 character set, and the values 
are a mixture of alphabet and Chinese (or other none-ISO8859-1 characters), the 
performance will degrade exponentially according to the number of alphabet 
characters.
{noformat}
>> CREATE TABLE t1 (a VARCHAR(65535) CHARACTER SET UTF8);
>> INSERT INTO t1 VALUES ('aaa...中文');
{noformat}
Here is a simple test result with different number of 'a's in the insert value:
{noformat}
100  'a's                 10 ms
1k   'a's                 100 ms
10k  'a's                 10 seconds
60k  'a's                 160 seconds{noformat}
The reason is because we internally convert UTF8 to UCS2, which means all 'a's 
will become "a\0"s, and then ConstValue::getText() function will try to replace 
all '\0' to "\\\\
0", and thus cause the performance degrade exponentially with the number of 
'a's in the string.
{code:java}
const NAString ConstValue::getText() const
 {
 if(getType()->getTypeQualifier() == NA_CHARACTER_TYPE)
 {
 ...

// Change imbedded NULL and \377 chars to \0 and \377
 // This comes up in key values quite often.
 size_t index;
 while((index = result.first('\0')) != NA_NPOS
 && index != result.length())
 result(index,1) = "
0";
 while((index = result.first('\377')) != NA_NPOS
 && index != result.length())
 result(index,1) = "
377";

...
 }

{code}


> INSERT values to VARCHAR column of UTF8 charset can be extremely slow
> ---------------------------------------------------------------------
>
>                 Key: TRAFODION-3265
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-3265
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: sql-exe
>    Affects Versions: any
>            Reporter: He Zhenxing
>            Priority: Major
>             Fix For: 2.4
>
>
> When inserting values to varchar column of UTF8 character set, and the values 
> are a mixture of alphabet and Chinese (or other none-ISO8859-1 characters), 
> the performance will degrade exponentially according to the number of 
> alphabet characters.
> {noformat}
> >> CREATE TABLE t1 (a VARCHAR(65535) CHARACTER SET UTF8);
> >> INSERT INTO t1 VALUES ('aaa...中文');
> {noformat}
> Here is a simple test result with different number of 'a's in the insert 
> value:
> {noformat}
> 100  'a's                 10 ms
> 1k   'a's                 100 ms
> 10k  'a's                 10 seconds
> 60k  'a's                 160 seconds{noformat}
> The reason is because we internally convert UTF8 to UCS2, which means all 
> 'a's will become "a\0"s, and then ConstValue::getText() function will try to 
> replace all '\0' to "\\0", and thus cause the performance degrade 
> exponentially with the number of 'a's in the string.
> {code:java}
> const NAString ConstValue::getText() const
>  {
>  if(getType()->getTypeQualifier() == NA_CHARACTER_TYPE)
>  {
>  ...
> // Change imbedded NULL and \377 chars to \0 and \377
>  // This comes up in key values quite often.
>  size_t index;
>  while((index = result.first('\0')) != NA_NPOS
>  && index != result.length())
>  result(index,1) = "
> 0";
>  while((index = result.first('\377')) != NA_NPOS
>  && index != result.length())
>  result(index,1) = "
> 377";
> ...
>  }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to