[ 
https://issues.apache.org/jira/browse/IMPALA-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Zihao reassigned IMPALA-12718:
---------------------------------

    Assignee: Ye Zihao

> trim() functions are lack of utf-8 support
> ------------------------------------------
>
>                 Key: IMPALA-12718
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12718
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Quanlong Huang
>            Assignee: Ye Zihao
>            Priority: Critical
>              Labels: ramp-up
>
> The following string functions are lack of UTF-8 support:
> {noformat}
> BTRIM(STRING a, STRING chars_to_trim)
> LTRIM(STRING a, STRING chars_to_trim)
> RTRIM(STRING a , STRING chars_to_trim)
> {noformat}
> Here is an issue reported by our user:
> {noformat}
> [localhost:21050] default> select rtrim('价格,', ',');
> +-----------------------+
> | rtrim('价格,', ',') |
> +-----------------------+
> | 价�                   |
> +-----------------------+{noformat}
> The result is the same if setting utf8_mode=true. Note that the comma used in 
> the above strings is Chinese punctuation mark ',' , not English(ASCII) mark 
> ','.
> The cause is that the Chinese character ',' is used as a char set. The utf8 
> encoding of these characters:
>  * '价': 0xe4 0xbb 0xb7
>  * '格': 0xe6 0xa0 0xbc
>  * ',': 0xef 0xbc 0x8c
> Each character is encoded into 3 bytes. The last byte of '格' is 0xbc which 
> also appears in the bytes of ','. So it's removed as well. The result is a 
> string of '价' and the first two bytes of '格'. The last character becomes a 
> malformed unicode so it's replaced with '�'.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to