[
https://issues.apache.org/jira/browse/IMPALA-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384389#comment-17384389
]
ASF subversion and git services commented on IMPALA-2019:
---------------------------------------------------------
Commit 4df03a31ec77b54138aba2805ff5e376463c8e23 in impala's branch
refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=4df03a3 ]
IMPALA-2019(Part-2): Provide UTF-8 support in instr() and locate()
Similar to the previous patch, this patch adds UTF-8 support in instr()
and locate() builtin functions so they can have consistent behaviors
with Hive's. These two string functions both have an optional argument
as position:
INSTR(STRING str, STRING substr[, BIGINT position[, BIGINT occurrence]])
LOCATE(STRING substr, STRING str[, INT pos])
Their return values are positions of the matched substring.
In UTF-8 mode (turned on by set UTF8_MODE=true), these positions are
counted by UTF-8 characters instead of bytes.
Error handling:
Malformed UTF-8 characters are counted as one byte per character. This
is consistent with Hive since Hive replaces those bytes to U+FFFD
(REPLACEMENT CHARACTER). E.g. GenericUDFInstr calls Text#toString(),
which performs the replacement. We can provide more behaviors on error
handling like ignoring them or reporting errors. IMPALA-10761 will focus
on this.
Tests:
- Add BE unit tests and e2e tests
- Add random tests to make sure malformed UTF-8 characters won't crash
us.
Change-Id: Ic13c3d04649c1aea56c1aaa464799b5e4674f662
Reviewed-on: http://gerrit.cloudera.org:8080/17580
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Proper UTF-8 support in string functions
> ----------------------------------------
>
> Key: IMPALA-2019
> URL: https://issues.apache.org/jira/browse/IMPALA-2019
> Project: IMPALA
> Issue Type: New Feature
> Components: Backend
> Affects Versions: Impala 2.1, Impala 2.2
> Reporter: Andrés Cordero
> Assignee: Quanlong Huang
> Priority: Critical
> Labels: sql-language
>
> As documented here:
> https://impala.apache.org/docs/build/html/topics/impala_string.html
> Impala does not properly handle non-ASCII UTF-8 characters, and will return
> results in string functions such as length that are inconsistent with Hive.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]