[jira] [Commented] (IMPALA-5675) Support CHAR/VARCHAR length counted in number of UTF-8 characters, not bytes

ASF subversion and git services (Jira) Wed, 01 Apr 2020 09:16:26 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-5675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072922#comment-17072922
 ]


ASF subversion and git services commented on IMPALA-5675:
---------------------------------------------------------

Commit 2576952655d8e252943379dd4dbcdd0315e457c5 in impala's branch 
refs/heads/master from Attila Bukor
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=2576952 ]

IMPALA-5092 Add support for VARCHAR in Kudu tables

KUDU-1938 added VARCHAR column type support to Kudu.
This commit adds support for Kudu's VARCHAR type to Impala.

The length of a Kudu varchar is applied as a character length as opposed
to a byte length like Impala currently uses.

When writing data to Kudu, the VARCHAR length is not an issue because
Impala only officially supports ASCII characters and those characters are
the same size in bytes and characters. Additionally, extra bytes would be
truncated by the Kudu client if somehow a value was too long.

When reading data from Kudu, it is possible that the value written by
some other application is wider in bytes than Impala expects and can
handle. This can happen due to multi-byte UTF-8 characters. In that
case, we adjust the length in Impala to truncate the extra bytes of the
value. This isn’t a great solution, but one other integrations have taken
as well given Impala doesn’t support UTF-8 values.

IMPALA-5675 tracks adding UTF-8 Character length support to VARCHAR
columns and marked the truncation code with a TODO that references
that Jira.

Testing:
* Performed manual testing of standard DDL and DML interaction
* Manually reproduced a check failure due to multi-byte characters
  and tested that length truncation resolve that issue.
* Added/adjusted the following automated tests:
** AnalyzeDDLTest: CTAS into Kudu with varchar type
** AnalyzeKuduDDLTest: CREATE TABLE in Kudu with VARCHAR type
** kudu_create.test: Create table with VARCHAR column, key, hash
   partition, and range partition
** kudu_describe.test: Describe table with VARCHAR column and key
** kudu_insert.test: Insert with VARCHAR columns including null and
   non-null defaults
** kudu_update.test: Updates with VARCHAR column
** kudu_upsert.test: Upserts with VARCHAR column
** kudu_delete.test Deletes with VARCHAR columns
** kudu-scan-node.test Tests basic predicates with VARCHAR columns

Follow on work:
- IMPALA-9580: Add min-max runtime filter support/tests
- IMPALA-9581: Pushdown string predicates
- IMPALA-9583: Automated multibyte truncation tests

Change-Id: I0d4959410fdd882bfa980cb55e8a7837c7823da8
Reviewed-on: http://gerrit.cloudera.org:8080/14197
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Thomas Tauber-Marshall <[email protected]>


> Support CHAR/VARCHAR length counted in number of UTF-8 characters, not bytes
> ----------------------------------------------------------------------------
>
>                 Key: IMPALA-5675
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5675
>             Project: IMPALA
>          Issue Type: Bug
>    Affects Versions: Impala 2.7.0
>         Environment: Cloudera distro 5.10.1
>            Reporter: Branislav Lukáč
>            Priority: Major
>         Attachments: Hive_query.png, Impala_query.png
>
>
> We have created external table with the following query:
> CREATE EXTERNAL TABLE IF NOT EXISTS SAPNSQ.ZAP_GL_EX_IM_CSV ( GLREQUEST 
> DECIMAL(30), KNUMC STRING, FACCP STRING, FCHAR VARCHAR(20), FCLNT VARCHAR(3), 
> FCUKY STRING, FCURR DOUBLE, FDATS STRING, FDEC DECIMAL(8, 2), FFLTP FLOAT, 
> FINT1 TINYINT, FINT2 SMALLINT, FINT4 BIGINT, FLANG STRING, FPREC DOUBLE, 
> FQUAN DOUBLE, FTIMS STRING, FUNIT STRING, FSSTRING STRING, FCHAR40 
> VARCHAR(40) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" STORED AS 
> TEXTFILE LOCATION 
> "hdfs:///user/nsqhdp/H_CDC_IMPQ/ZAP_GL_EX_IM_CSV/63E55F5943E95122E1000000C0A83051"
>  
> CSV files are already present on specified location 
> hdfs:///user/nsqhdp/H_CDC_IMPQ/ZAP_GL_EX_IM_CSV/63E55F5943E95122E1000000C0A83051
>  
> When we execute Select fchar40 FROM sapnsq.zap_gl_ex_im_csv ORDER BY fchar40 
> with both Hive and Impala, we get different results:
> - Hive (see Hive_query.png)
> - Impala (see Impala_query.png)
> Seems that Impala engine is truncating strings when they contain non-ASCII 
> characters.
> So if a character is encoded with 2 bytes, Impala counts it as 2 chars 
> (instead of 1).
> Then the  FCHAR40 VARCHAR(40) will actually return less than 40 characters.
>  
> Example:
> 1st row contains 3 special characters: É, Ï and ü
> Select with Impala truncates the result by 3 characters.
> According to Impala documentation 
> (https://www.cloudera.com/documentation/enterprise/5-7-x/topics/impala_varchar.html),
>  Unicode should be supported:
> "All data in CHAR and VARCHAR columns must be in a character encoding that is 
> compatible with UTF-8"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-5675) Support CHAR/VARCHAR length counted in number of UTF-8 characters, not bytes

Reply via email to