[ 
https://issues.apache.org/jira/browse/IMPALA-5675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on IMPALA-5675 started by Quanlong Huang.
----------------------------------------------
> Support CHAR/VARCHAR length counted in number of UTF-8 characters, not bytes
> ----------------------------------------------------------------------------
>
>                 Key: IMPALA-5675
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5675
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 2.7.0
>         Environment: Cloudera distro 5.10.1
>            Reporter: Branislav Lukáč
>            Assignee: Quanlong Huang
>            Priority: Major
>         Attachments: Hive_query.png, Impala_query.png
>
>
> We have created external table with the following query:
> CREATE EXTERNAL TABLE IF NOT EXISTS SAPNSQ.ZAP_GL_EX_IM_CSV ( GLREQUEST 
> DECIMAL(30), KNUMC STRING, FACCP STRING, FCHAR VARCHAR(20), FCLNT VARCHAR(3), 
> FCUKY STRING, FCURR DOUBLE, FDATS STRING, FDEC DECIMAL(8, 2), FFLTP FLOAT, 
> FINT1 TINYINT, FINT2 SMALLINT, FINT4 BIGINT, FLANG STRING, FPREC DOUBLE, 
> FQUAN DOUBLE, FTIMS STRING, FUNIT STRING, FSSTRING STRING, FCHAR40 
> VARCHAR(40) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" STORED AS 
> TEXTFILE LOCATION 
> "hdfs:///user/nsqhdp/H_CDC_IMPQ/ZAP_GL_EX_IM_CSV/63E55F5943E95122E1000000C0A83051"
>  
> CSV files are already present on specified location 
> hdfs:///user/nsqhdp/H_CDC_IMPQ/ZAP_GL_EX_IM_CSV/63E55F5943E95122E1000000C0A83051
>  
> When we execute Select fchar40 FROM sapnsq.zap_gl_ex_im_csv ORDER BY fchar40 
> with both Hive and Impala, we get different results:
> - Hive (see Hive_query.png)
> - Impala (see Impala_query.png)
> Seems that Impala engine is truncating strings when they contain non-ASCII 
> characters.
> So if a character is encoded with 2 bytes, Impala counts it as 2 chars 
> (instead of 1).
> Then the  FCHAR40 VARCHAR(40) will actually return less than 40 characters.
>  
> Example:
> 1st row contains 3 special characters: É, Ï and ü
> Select with Impala truncates the result by 3 characters.
> According to Impala documentation 
> (https://www.cloudera.com/documentation/enterprise/5-7-x/topics/impala_varchar.html),
>  Unicode should be supported:
> "All data in CHAR and VARCHAR columns must be in a character encoding that is 
> compatible with UTF-8"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to