Rui Guo created ASTERIXDB-2762:
----------------------------------

             Summary: Use code point as the unit in string-related functions
                 Key: ASTERIXDB-2762
                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-2762
             Project: Apache AsterixDB
          Issue Type: New Feature
            Reporter: Rui Guo


Currently, we use byte or Java char (2 byte) as the unitΒ in the string-related 
built-in functions. For example, in substr(string, offset, len) the offset and 
len are in the unit of Java char. However, for non-English characters such as 
Emoji and Korean, one character may have multiple bytes and thus it is possible 
to split one character into a few illegal parts.

Using code point as the unit would be a more natural way to deal with those 
multi-byte characters.

Β 

List of functions that need to be updated:


||Function||Expected output||
|substr("πŸ‡ΊπŸ‡ΈπŸ‡¨πŸ‡³", 2, 2);|"πŸ‡¨πŸ‡³"|
|trim("πŸ‡ΊπŸ‡Έ", "πŸ‡Ί");|"πŸ‡Έ"|
|reverse("πŸ‡¨πŸ‡³");|"πŸ‡³πŸ‡¨"Β |
|length(β€œπŸ‡¨πŸ‡³");|Β 2|
|Β position("πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦πŸ‡¨πŸ‡³", "πŸ‡¨πŸ‡³");|Β 7|
|regexp_contains("πŸ‡¨πŸ‡³", "πŸ‡¨πŸ‡³");regexp_like("πŸ‡¨πŸ‡³", "πŸ‡¨πŸ‡³");|true
trueΒ |
|string_to_codepoint("πŸ‡ΊπŸ‡Έ");|Β [ 127482, 127480 ]|
|codepoint_to_string([ 127482, 127480]);|"πŸ‡ΊπŸ‡Έ"Β |

Β 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to