[
https://issues.apache.org/jira/browse/IMPALA-5675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Work on IMPALA-5675 started by Quanlong Huang.
----------------------------------------------
> Support CHAR/VARCHAR length counted in number of UTF-8 characters, not bytes
> ----------------------------------------------------------------------------
>
> Key: IMPALA-5675
> URL: https://issues.apache.org/jira/browse/IMPALA-5675
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Affects Versions: Impala 2.7.0
> Environment: Cloudera distro 5.10.1
> Reporter: Branislav Lukáč
> Assignee: Quanlong Huang
> Priority: Major
> Attachments: Hive_query.png, Impala_query.png
>
>
> We have created external table with the following query:
> CREATE EXTERNAL TABLE IF NOT EXISTS SAPNSQ.ZAP_GL_EX_IM_CSV ( GLREQUEST
> DECIMAL(30), KNUMC STRING, FACCP STRING, FCHAR VARCHAR(20), FCLNT VARCHAR(3),
> FCUKY STRING, FCURR DOUBLE, FDATS STRING, FDEC DECIMAL(8, 2), FFLTP FLOAT,
> FINT1 TINYINT, FINT2 SMALLINT, FINT4 BIGINT, FLANG STRING, FPREC DOUBLE,
> FQUAN DOUBLE, FTIMS STRING, FUNIT STRING, FSSTRING STRING, FCHAR40
> VARCHAR(40) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" STORED AS
> TEXTFILE LOCATION
> "hdfs:///user/nsqhdp/H_CDC_IMPQ/ZAP_GL_EX_IM_CSV/63E55F5943E95122E1000000C0A83051"
>
> CSV files are already present on specified location
> hdfs:///user/nsqhdp/H_CDC_IMPQ/ZAP_GL_EX_IM_CSV/63E55F5943E95122E1000000C0A83051
>
> When we execute Select fchar40 FROM sapnsq.zap_gl_ex_im_csv ORDER BY fchar40
> with both Hive and Impala, we get different results:
> - Hive (see Hive_query.png)
> - Impala (see Impala_query.png)
> Seems that Impala engine is truncating strings when they contain non-ASCII
> characters.
> So if a character is encoded with 2 bytes, Impala counts it as 2 chars
> (instead of 1).
> Then the FCHAR40 VARCHAR(40) will actually return less than 40 characters.
>
> Example:
> 1st row contains 3 special characters: É, Ï and ü
> Select with Impala truncates the result by 3 characters.
> According to Impala documentation
> (https://www.cloudera.com/documentation/enterprise/5-7-x/topics/impala_varchar.html),
> Unicode should be supported:
> "All data in CHAR and VARCHAR columns must be in a character encoding that is
> compatible with UTF-8"
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]