Rui Guo created ASTERIXDB-2762:
----------------------------------
Summary: Use code point as the unit in string-related functions
Key: ASTERIXDB-2762
URL: https://issues.apache.org/jira/browse/ASTERIXDB-2762
Project: Apache AsterixDB
Issue Type: New Feature
Reporter: Rui Guo
Currently, we use byte or Java char (2 byte) as the unitΒ in the string-related
built-in functions. For example, in substr(string, offset, len) the offset and
len are in the unit of Java char. However, for non-English characters such as
Emoji and Korean, one character may have multiple bytes and thus it is possible
to split one character into a few illegal parts.
Using code point as the unit would be a more natural way to deal with those
multi-byte characters.
Β
List of functions that need to be updated:
||Function||Expected output||
|substr("πΊπΈπ¨π³", 2, 2);|"π¨π³"|
|trim("πΊπΈ", "πΊ");|"πΈ"|
|reverse("π¨π³");|"π³π¨"Β |
|length(βπ¨π³");|Β 2|
|Β position("π©βπ©βπ§βπ¦π¨π³", "π¨π³");|Β 7|
|regexp_contains("π¨π³", "π¨π³");regexp_like("π¨π³", "π¨π³");|true
trueΒ |
|string_to_codepoint("πΊπΈ");|Β [ 127482, 127480 ]|
|codepoint_to_string([ 127482, 127480]);|"πΊπΈ"Β |
Β
--
This message was sent by Atlassian Jira
(v8.3.4#803005)