[
https://issues.apache.org/jira/browse/IMPALA-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17902315#comment-17902315
]
ASF subversion and git services commented on IMPALA-12718:
----------------------------------------------------------
Commit 81f2673883f65aa71f682bf9fda6dd73888e75a8 in impala's branch
refs/heads/master from Mihaly Szjatinya
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=81f267388 ]
IMPALA-889: Add trim() function matching ANSI SQL definition
As agreed in JIRA discussions, the current PR extends existing TRIM
functionality with the support of SQL-standardized TRIM-FROM syntax:
TRIM({[LEADING / TRAILING / BOTH] | [STRING characters]} FROM expr).
Implemented based on the existing LTRIM / RTRIM / BTRIM family of
functions prepared earlier in IMPALA-6059 and extended for UTF-8 in
IMPALA-12718. Besides, partly based on abandoned PR
https://gerrit.cloudera.org/#/c/4474 and similar EXTRACT-FROM
functionality from https://github.com/apache/impala/commit/543fa73f3a846
f0e4527514c993cb0985912b06c.
Supported syntaxes:
Syntax #1 TRIM(<where> FROM <string>);
Syntax #2 TRIM(<charset> FROM <string>);
Syntax #3 TRIM(<where> <charset> FROM <string>);
"where": Case-insensitive trim direction. Valid options are "leading",
"trailing", and "both". "leading" means trimming characters from the
start; "trailing" means trimming characters from the end; "both" means
trimming characters from both sides. For Syntax #2, since no "where"
is specified, the option "both" is implied by default.
"charset": Case-sensitive characters to be removed. This argument is
regarded as a character set going to be removed. The occurrence order
of each character doesn't matter and duplicated instances of the same
character will be ignored. NULL argument implies " " (standard space)
by default. Empty argument ("" or '') makes TRIM return the string
untouched. For Syntax #1, since no "charset" is specified, it trims
" " (standard space) by default.
"string": Case-sensitive target string to trim. This argument can be
NULL.
The UTF8_MODE query option is honored by TRIM-FROM, similarly to
existing TRIM().
UTF8_TRIM-FROM can be used to force UTF8 mode regardless of the query
option.
Design Notes:
1. No-BE. Since the existing LTRIM / RTRIM / BTRIM functions fully cover
all needed use-cases, no backend logic is required. This differs from
similar EXTRACT-FROM.
2. Syntax wrapper. TrimFromExpr class was introduced as a syntax
wrapper around FunctionCallExpr, which instantiates one of the regular
LTRIM / RTRIM / BTRIM functions. TrimFromExpr's role is to maintain
the integrity of the "phantom" TRIM-FROM built-in function.
3. No TRIM keyword. Following EXTRACT-FROM, no "TRIM" keyword was
added to the language. Although generally a keyword would allow easier
and better parsing, on the negative side it restricts token's usage in
general context. However, leading/trailing/both, being previously
saved as reserved words, are now added as keywords to make possible
their usage with no escaping.
Change-Id: I3c4fa6d0d8d0684c4b6d8dac8fd531d205e4f7b4
Reviewed-on: http://gerrit.cloudera.org:8080/21825
Reviewed-by: Csaba Ringhofer <[email protected]>
Tested-by: Csaba Ringhofer <[email protected]>
> trim() functions are lack of utf-8 support
> ------------------------------------------
>
> Key: IMPALA-12718
> URL: https://issues.apache.org/jira/browse/IMPALA-12718
> Project: IMPALA
> Issue Type: Bug
> Reporter: Quanlong Huang
> Assignee: Zihao Ye
> Priority: Critical
> Labels: ramp-up
> Fix For: Impala 4.4.0
>
>
> The following string functions are lack of UTF-8 support:
> {noformat}
> BTRIM(STRING a, STRING chars_to_trim)
> LTRIM(STRING a, STRING chars_to_trim)
> RTRIM(STRING a , STRING chars_to_trim)
> {noformat}
> Here is an issue reported by our user:
> {noformat}
> [localhost:21050] default> select rtrim('价格,', ',');
> +-----------------------+
> | rtrim('价格,', ',') |
> +-----------------------+
> | 价� |
> +-----------------------+{noformat}
> The result is the same if setting utf8_mode=true. Note that the comma used in
> the above strings is Chinese punctuation mark ',' , not English(ASCII) mark
> ','.
> The cause is that the Chinese character ',' is used as a char set. The utf8
> encoding of these characters:
> * '价': 0xe4 0xbb 0xb7
> * '格': 0xe6 0xa0 0xbc
> * ',': 0xef 0xbc 0x8c
> Each character is encoded into 3 bytes. The last byte of '格' is 0xbc which
> also appears in the bytes of ','. So it's removed as well. The result is a
> string of '价' and the first two bytes of '格'. The last character becomes a
> malformed unicode so it's replaced with '�'.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]