[
https://issues.apache.org/jira/browse/TRAFODION-3265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
He Zhenxing updated TRAFODION-3265:
-----------------------------------
Description:
When inserting values to varchar column of UTF8 character set, and the values
are a mixture of alphabet and Chinese (or other none-ISO8859-1 characters), the
performance will degrade exponentially according to the number of alphabet
characters.
{noformat}
>> CREATE TABLE t1 (a VARCHAR(65535) CHARACTER SET UTF8);
>> INSERT INTO t1 VALUES ('aaa...中文');
{noformat}
Here is a simple test result with different number of 'a's in the insert value:
{noformat}
100 'a's 10 ms
1k 'a's 100 ms
10k 'a's 10 seconds
60k 'a's 160 seconds{noformat}
The reason is because we internally convert UTF8 to UCS2, which means all 'a's
will become "a\0"s, and then ConstValue::getText() function will try to replace
all '\0' to "\\ 0", and thus cause the performance degrade exponentially with
the number of 'a's in the string.
{code:java}
const NAString ConstValue::getText() const
{
if(getType()->getTypeQualifier() == NA_CHARACTER_TYPE)
{
...
// Change imbedded NULL and \377 chars to \0 and \377
// This comes up in key values quite often.
size_t index;
while((index = result.first('\0')) != NA_NPOS
&& index != result.length())
result(index,1) = "
0";
while((index = result.first('\377')) != NA_NPOS
&& index != result.length())
result(index,1) = "
377";
...
}
{code}
was:
When inserting values to varchar column of UTF8 character set, and the values
are a mixture of alphabet and Chinese (or other none-ISO8859-1 characters), the
performance will degrade exponentially according to the number of alphabet
characters.
{noformat}
>> CREATE TABLE t1 (a VARCHAR(65535) CHARACTER SET UTF8);
>> INSERT INTO t1 VALUES ('aaa...中文');
{noformat}
Here is a simple test result with different number of 'a's in the insert value:
{noformat}
100 'a's 10 ms
1k 'a's 100 ms
10k 'a's 10 seconds
60k 'a's 160 seconds{noformat}
The reason is because we internally convert UTF8 to UCS2, which means all 'a's
will become "a\0"s, and then ConstValue::getText() function will try to replace
all '\0' to "\\0", and thus cause the performance degrade exponentially with
the number of 'a's in the string.
{code:java}
const NAString ConstValue::getText() const
{
if(getType()->getTypeQualifier() == NA_CHARACTER_TYPE)
{
...
// Change imbedded NULL and \377 chars to \0 and \377
// This comes up in key values quite often.
size_t index;
while((index = result.first('\0')) != NA_NPOS
&& index != result.length())
result(index,1) = "
0";
while((index = result.first('\377')) != NA_NPOS
&& index != result.length())
result(index,1) = "
377";
...
}
{code}
> INSERT values to VARCHAR column of UTF8 charset can be extremely slow
> ---------------------------------------------------------------------
>
> Key: TRAFODION-3265
> URL: https://issues.apache.org/jira/browse/TRAFODION-3265
> Project: Apache Trafodion
> Issue Type: Bug
> Components: sql-exe
> Affects Versions: any
> Reporter: He Zhenxing
> Priority: Major
> Fix For: 2.4
>
>
> When inserting values to varchar column of UTF8 character set, and the values
> are a mixture of alphabet and Chinese (or other none-ISO8859-1 characters),
> the performance will degrade exponentially according to the number of
> alphabet characters.
> {noformat}
> >> CREATE TABLE t1 (a VARCHAR(65535) CHARACTER SET UTF8);
> >> INSERT INTO t1 VALUES ('aaa...中文');
> {noformat}
> Here is a simple test result with different number of 'a's in the insert
> value:
> {noformat}
> 100 'a's 10 ms
> 1k 'a's 100 ms
> 10k 'a's 10 seconds
> 60k 'a's 160 seconds{noformat}
> The reason is because we internally convert UTF8 to UCS2, which means all
> 'a's will become "a\0"s, and then ConstValue::getText() function will try to
> replace all '\0' to "\\ 0", and thus cause the performance degrade
> exponentially with the number of 'a's in the string.
> {code:java}
> const NAString ConstValue::getText() const
> {
> if(getType()->getTypeQualifier() == NA_CHARACTER_TYPE)
> {
> ...
> // Change imbedded NULL and \377 chars to \0 and \377
> // This comes up in key values quite often.
> size_t index;
> while((index = result.first('\0')) != NA_NPOS
> && index != result.length())
> result(index,1) = "
> 0";
> while((index = result.first('\377')) != NA_NPOS
> && index != result.length())
> result(index,1) = "
> 377";
> ...
> }
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)