[
https://issues.apache.org/jira/browse/HIVE-14533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15419657#comment-15419657
]
Thomas Friedrich commented on HIVE-14533:
-----------------------------------------
The patch adds a check to enforceMaxLength to only enforce the maxLength if the
string is longer than maxLength. This check can be done without decoding the
string, so it saves the unnecessary decoding of every value.
HiveVarcharWritable: if (value.getLength()>maxLength &&
getCharacterLength()>maxLength)
- value.getLength is the number of bytes of the string
- maxLength is the max number of characters
For single-byte characters, the number of bytes is similar to the number of
characters. For double-byte characters, the number of characters is less than
the number of bytes. If the number of bytes is lower than maxLength, then the
string has fewer than maxLength characters and we don't have to truncate the
string. If the number of bytes is larger than the number of characters, we need
to compare the characterLength with the maxLength. We could just compare
getCharacterLength()>maxLength in any case, but getCharacterLength calls
getTextUtfLength which takes more time than just comparing the byte length with
maxLength.
HiveCharwritable: if (getCharacterLength()!=maxLength)
For char values, we can only compare the number of characters with the
maxLength and if it's different we need to call set to enforce the right
length. This is to ensure we get the padded value if the string is not long
enough and to truncate it in case it's longer. If we were to compare the bytes
(value.getLength()) with maxLength, then it might not enforce the maxLength if
double-byte characters are involved.
> improve performance of enforceMaxLength in
> HiveCharWritable/HiveVarcharWritable
> -------------------------------------------------------------------------------
>
> Key: HIVE-14533
> URL: https://issues.apache.org/jira/browse/HIVE-14533
> Project: Hive
> Issue Type: Improvement
> Components: Serializers/Deserializers
> Affects Versions: 1.2.1, 2.1.0
> Reporter: Thomas Friedrich
> Assignee: Thomas Friedrich
> Priority: Minor
> Labels: performance
> Attachments: HIVE-14533.patch
>
>
> The enforceMaxLength method in HiveVarcharWritable calls
> set(getHiveVarchar(), maxLength); and in HiveCharWritable set(getHiveChar(),
> maxLength); no matter how long the string is. The calls to getHiveVarchar()
> and getHiveChar() decode the string every time the method is called
> (Text.toString() calls Text.decode). This can be very expensive and is
> unnecessary if the string is shorter than maxLength for HiveVarcharWritable
> or different than maxLength for HiveCharWritable.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)