[
https://issues.apache.org/jira/browse/HIVE-11112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601716#comment-14601716
]
Yongzhi Chen commented on HIVE-11112:
-------------------------------------
org.apache.hadoop.io.Text 's byte array is not necessarily null terminated. But
SerDeUtils.transformTextToUTF8 method just use Text byte array to create String
object which causes the unclean buffer issue.
Fix it by using String constructor with lengh info.
> ISO-8859-1 text output has fragments of previous longer rows appended
> ---------------------------------------------------------------------
>
> Key: HIVE-11112
> URL: https://issues.apache.org/jira/browse/HIVE-11112
> Project: Hive
> Issue Type: Bug
> Components: Serializers/Deserializers
> Affects Versions: 1.2.0
> Reporter: Yongzhi Chen
> Assignee: Yongzhi Chen
>
> If a LazySimpleSerDe table is created using ISO 8859-1 encoding, query
> results for a string column are incorrect for any row that was preceded by a
> row containing a longer string.
> Example steps to reproduce:
> 1. Create a table using ISO 8859-1 encoding:
> CREATE TABLE person_lat1 (name STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH
> SERDEPROPERTIES ('serialization.encoding'='ISO8859_1');
> 2. Copy an ISO-8859-1 encoded text file into the appropriate warehouse folder
> in HDFS. I'll attach an example file containing the following text:
> Müller,Thomas
> Jørgensen,Jørgen
> Peña,Andrés
> Nåm,Fæk
> 3. Execute SELECT * FROM person_lat1
> Result - The following output appears:
> +-------------------+--+
> | person_lat1.name |
> +-------------------+--+
> | Müller,Thomas |
> | Jørgensen,Jørgen |
> | Peña,Andrésørgen |
> | Nåm,Fækdrésørgen |
> +-------------------+--+
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)