[ 
https://issues.apache.org/jira/browse/HIVE-14533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419657#comment-15419657
 ] 

Thomas Friedrich commented on HIVE-14533:
-----------------------------------------

The patch adds a check to enforceMaxLength to only enforce the maxLength if the 
string is longer than maxLength. This check can be done without decoding the 
string, so it saves the unnecessary decoding of every value.

HiveVarcharWritable: if (value.getLength()>maxLength && 
getCharacterLength()>maxLength)
- value.getLength is the number of bytes of the string
- maxLength is the max number of characters
For single-byte characters, the number of bytes is similar to the number of 
characters. For double-byte characters, the number of characters is less than 
the number of bytes. If the number of bytes is lower than maxLength, then the 
string has fewer than maxLength characters and we don't have to truncate the 
string. If the number of bytes is larger than the number of characters, we need 
to compare the characterLength with the maxLength. We could just compare 
getCharacterLength()>maxLength in any case, but getCharacterLength calls 
getTextUtfLength which takes more time than just comparing the byte length with 
maxLength.

HiveCharwritable: if (getCharacterLength()!=maxLength)
For char values, we can only compare the number of characters with the 
maxLength and if it's different we need to call set to enforce the right 
length. This is to ensure we get the padded value if the string is not long 
enough and to truncate it in case it's longer. If we were to compare the bytes 
(value.getLength()) with maxLength, then it might not enforce the maxLength if 
double-byte characters are involved.



> improve performance of enforceMaxLength in 
> HiveCharWritable/HiveVarcharWritable
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-14533
>                 URL: https://issues.apache.org/jira/browse/HIVE-14533
>             Project: Hive
>          Issue Type: Improvement
>          Components: Serializers/Deserializers
>    Affects Versions: 1.2.1, 2.1.0
>            Reporter: Thomas Friedrich
>            Assignee: Thomas Friedrich
>            Priority: Minor
>              Labels: performance
>         Attachments: HIVE-14533.patch
>
>
> The enforceMaxLength method in HiveVarcharWritable calls 
> set(getHiveVarchar(), maxLength); and in HiveCharWritable set(getHiveChar(), 
> maxLength); no matter how long the string is. The calls to getHiveVarchar() 
> and getHiveChar() decode the string every time the method is called 
> (Text.toString() calls Text.decode). This can be very expensive and is 
> unnecessary if the string is shorter than maxLength for HiveVarcharWritable 
> or different than maxLength for HiveCharWritable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to