[ 
https://issues.apache.org/jira/browse/HIVE-28728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simhadri Govindappa updated HIVE-28728:
---------------------------------------
    Description: 
Chinese characters turn to garbled characters on using INSERT OVERWRITE query 
and using STR_TO_MAP() function

Repro steps:
1. Text data file
{code:java}
100 hive
200 spark
300 oozie
400 airflow
500 优惠活动
{code}
{{2. Create table on top of it}}


{code:java}
CREATE external TABLE t1(
id string,
name string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
STORED AS TEXTFILE
LOCATION 's3://prmsingh-hive/garbled/rawdata/';

insert into table test1 values ('2010-01-01', '优惠活动');
{code}


{{3.}}

 

{{{{{}select STR_TO_MAP(concat(id,":",name),',',':') from t7;{}}}}}
{{{{OK
{"100":"hive"}
{"200":"spark"}
{"300":"oozie"}
{"400":"airflow"}
{"500":"优惠活动"}}}}}

{{4.}}
{{}}

 

{{{{create external table result3
(cd MAP<STRING, STRING>)
location 's3://prmsingh-hive/garbled/result3/';}}}}

{{{{{}insert overwrite table result3 select 
STR_TO_MAP(concat(id,":",name),',',':') from t7;{}}}}}

{{{{{}hive> select * from result3;{}}}}}

{{{{OK
{"100":"hive"}
{"200":"spark"}
{"300":"oozie"}
{"400":"airflow"}
{"500":"????"}}}}}

 

But when I create the table and insert the data when vectorization is disabled. 
Then the result is fine

  was:
Chinese characters turn to garbled characters on using INSERT OVERWRITE query 
and using STR_TO_MAP() function

Repro steps:
1. Text data file

{{{{100 hive
200 spark
300 oozie
400 airflow
500 优惠活动}} }}

{{2. Create table on top of it}}

CREATE external TABLE t1(
id string,
name string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
STORED AS TEXTFILE
LOCATION 's3://prmsingh-hive/garbled/rawdata/';



{{ insert into table test1 values ('2010-01-01', '优惠活动');}}

{{3.}}

 

{{{{select STR_TO_MAP(concat(id,":",name),',',':') from t7;}}}}
{{{{OK
\{"100":"hive"}
\{"200":"spark"}
\{"300":"oozie"}
\{"400":"airflow"}
\{"500":"优惠活动"}}}}}

{{4.}}
{{}}

 

{{{{create external table result3
(cd MAP<STRING, STRING>)
location 's3://prmsingh-hive/garbled/result3/';}}}}

{{{{insert overwrite table result3 select 
STR_TO_MAP(concat(id,":",name),',',':') from t7;}}}}

{{{{hive> select * from result3;}}}}

{{{{OK
\{"100":"hive"}
\{"200":"spark"}
\{"300":"oozie"}
\{"400":"airflow"}
\{"500":"????"}}}}}

 

But when I create the table and insert the data when vectorization is disabled. 
Then the result is fine


> In INSERT OVERWRITE queries, STR_TO_MAP() UDF is not using UTF-8 encoding 
> properly resulting in garbled characters
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-28728
>                 URL: https://issues.apache.org/jira/browse/HIVE-28728
>             Project: Hive
>          Issue Type: Bug
>          Components: Vectorization
>    Affects Versions: 4.0.0, 4.0.1
>            Reporter: Paramvir Singh
>            Assignee: Paramvir Singh
>            Priority: Major
>
> Chinese characters turn to garbled characters on using INSERT OVERWRITE query 
> and using STR_TO_MAP() function
> Repro steps:
> 1. Text data file
> {code:java}
> 100 hive
> 200 spark
> 300 oozie
> 400 airflow
> 500 优惠活动
> {code}
> {{2. Create table on top of it}}
> {code:java}
> CREATE external TABLE t1(
> id string,
> name string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ' '
> STORED AS TEXTFILE
> LOCATION 's3://prmsingh-hive/garbled/rawdata/';
> insert into table test1 values ('2010-01-01', '优惠活动');
> {code}
> {{3.}}
>  
> {{{{{}select STR_TO_MAP(concat(id,":",name),',',':') from t7;{}}}}}
> {{{{OK
> {"100":"hive"}
> {"200":"spark"}
> {"300":"oozie"}
> {"400":"airflow"}
> {"500":"优惠活动"}}}}}
> {{4.}}
> {{}}
>  
> {{{{create external table result3
> (cd MAP<STRING, STRING>)
> location 's3://prmsingh-hive/garbled/result3/';}}}}
> {{{{{}insert overwrite table result3 select 
> STR_TO_MAP(concat(id,":",name),',',':') from t7;{}}}}}
> {{{{{}hive> select * from result3;{}}}}}
> {{{{OK
> {"100":"hive"}
> {"200":"spark"}
> {"300":"oozie"}
> {"400":"airflow"}
> {"500":"????"}}}}}
>  
> But when I create the table and insert the data when vectorization is 
> disabled. Then the result is fine



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to