Quanlong Huang created IMPALA-12718:
---------------------------------------
Summary: trim() functions are lack of utf-8 support
Key: IMPALA-12718
URL: https://issues.apache.org/jira/browse/IMPALA-12718
Project: IMPALA
Issue Type: Bug
Reporter: Quanlong Huang
The following string functions are lack of UTF-8 support:
{noformat}
BTRIM(STRING a, STRING chars_to_trim)
LTRIM(STRING a, STRING chars_to_trim)
RTRIM(STRING a , STRING chars_to_trim)
{noformat}
Here is an issue reported by our user:
{noformat}
[localhost:21050] default> select rtrim('价格,', ',');
+-----------------------+
| rtrim('价格,', ',') |
+-----------------------+
| 价� |
+-----------------------+{noformat}
The result is the same if setting utf8_mode=true. Note that the comma used in
the above strings is Chinese punctuation mark ',' , not English(ASCII) mark ','.
The cause is that the Chinese character ',' is used as a char set. The utf8
encoding of these characters:
* '价': 0xe4 0xbb 0xb7
* '格': 0xe6 0xa0 0xbc
* ',': 0xef 0xbc 0x8c
Each character is encoded into 3 bytes. The last byte of '格' is 0xbc which also
appears in the bytes of ','. So it's removed as well. The result is a string of
'价' and the first two bytes of '格'. The last character becomes a malformed
unicode so it's replaced with '�'.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]