[ 
https://issues.apache.org/jira/browse/TRAFODION-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

He Zhenxing updated TRAFODION-3265:
-----------------------------------
    Description: 
When inserting values to varchar column of UTF8 character set, and the values 
are a mixture of alphabet and Chinese (or other none-ISO8859-1 characters), the 
performance will degrade exponentially according to the number of alphabet 
characters.

{noformat}
>> CREATE TABLE t1 (a VARCHAR(65535) CHARACTER SET UTF8);
>> INSERT INTO t1 VALUES ('aaa...中文');
{noformat}

Here is a simple test result with different number of 'a's in the insert value:

100 'a's                 10 ms

1k    'a's                 100 ms

10k  'a's                 10 seconds

60k  'a's                 160 seconds

The reason is because we internally convert UTF8 to UCS2, which means all 'a's 
will become "a\0"s, and then ConstValue::getText() function will try to replace 
all '\0' to "
 0', and thus cause the performance degrade exponentially with the number of 
'a's in the string.


{code:c++}
const NAString ConstValue::getText() const
 {
 if(getType()->getTypeQualifier() == NA_CHARACTER_TYPE)
 {
 ...

// Change imbedded NULL and \377 chars to \0 and \377
 // This comes up in key values quite often.
 size_t index;
 while((index = result.first('\0')) != NA_NPOS
 && index != result.length())
 result(index,1) = "
0";
 while((index = result.first('\377')) != NA_NPOS
 && index != result.length())
 result(index,1) = "
377";

...
 }

{code}


  was:
When inserting values to varchar column of UTF8 character set, and the values 
are a mixture of alphabet and Chinese (or other none-ISO8859-1 characters), the 
performance will degrade exponentially according to the number of alphabet 
characters.
{quote}>> CREATE TABLE t1 (a VARCHAR(65535) CHARACTER SET UTF8);

>> INSERT INTO t1 VALUES ('aaa...中文');
{quote}
Here is a simple test result with different number of 'a's in the insert value:

100 'a's                 10 ms

1k    'a's                 100 ms

10k  'a's                 10 seconds

60k  'a's                 160 seconds

The reason is because we internally convert UTF8 to UCS2, which means all 'a's 
will become "a\0"s, and then ConstValue::getText() function will try to replace 
all '\0' to "
 0', and thus cause the performance degrade exponentially with the number of 
'a's in the string.


{code:c++}
const NAString ConstValue::getText() const
 {
 if(getType()->getTypeQualifier() == NA_CHARACTER_TYPE)
 {
 ...

// Change imbedded NULL and \377 chars to \0 and \377
 // This comes up in key values quite often.
 size_t index;
 while((index = result.first('\0')) != NA_NPOS
 && index != result.length())
 result(index,1) = "
0";
 while((index = result.first('\377')) != NA_NPOS
 && index != result.length())
 result(index,1) = "
377";

...
 }

{code}



> INSERT values to VARCHAR column of UTF8 charset can be extremely slow
> ---------------------------------------------------------------------
>
>                 Key: TRAFODION-3265
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-3265
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: sql-exe
>    Affects Versions: any
>            Reporter: He Zhenxing
>            Priority: Major
>             Fix For: 2.4
>
>
> When inserting values to varchar column of UTF8 character set, and the values 
> are a mixture of alphabet and Chinese (or other none-ISO8859-1 characters), 
> the performance will degrade exponentially according to the number of 
> alphabet characters.
> {noformat}
> >> CREATE TABLE t1 (a VARCHAR(65535) CHARACTER SET UTF8);
> >> INSERT INTO t1 VALUES ('aaa...中文');
> {noformat}
> Here is a simple test result with different number of 'a's in the insert 
> value:
> 100 'a's                 10 ms
> 1k    'a's                 100 ms
> 10k  'a's                 10 seconds
> 60k  'a's                 160 seconds
> The reason is because we internally convert UTF8 to UCS2, which means all 
> 'a's will become "a\0"s, and then ConstValue::getText() function will try to 
> replace all '\0' to "
>  0', and thus cause the performance degrade exponentially with the number of 
> 'a's in the string.
> {code:c++}
> const NAString ConstValue::getText() const
>  {
>  if(getType()->getTypeQualifier() == NA_CHARACTER_TYPE)
>  {
>  ...
> // Change imbedded NULL and \377 chars to \0 and \377
>  // This comes up in key values quite often.
>  size_t index;
>  while((index = result.first('\0')) != NA_NPOS
>  && index != result.length())
>  result(index,1) = "
> 0";
>  while((index = result.first('\377')) != NA_NPOS
>  && index != result.length())
>  result(index,1) = "
> 377";
> ...
>  }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to