[jira] [Commented] (ASTERIXDB-2762) Use code point as the unit in string-related functions

ASF subversion and git services (Jira) Tue, 21 Jul 2020 17:06:14 -0700


    [ 
https://issues.apache.org/jira/browse/ASTERIXDB-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17162401#comment-17162401
 ]


ASF subversion and git services commented on ASTERIXDB-2762:
------------------------------------------------------------

Commit 0ea86a76b4499b27e0bd32f93a3df38f9ade3328 in asterixdb's branch 
refs/heads/master from Rui Guo
[ https://gitbox.apache.org/repos/asf?p=asterixdb.git;h=0ea86a7 ]

[ASTERIXDB-2762] Use code point as the unit in substr()

This commit aims to use code point as the unit in substr().

Currently, Java char (2 bytes) is used as the unit in substr(),
however, for non-English characters such as Emoji and Korean,
one character may have multiple bytes and thus can be splitted into a
few illegal parts if we use Java char as the unit.
Instead, code point is a more natural unit to split characters.

Change-Id: I5c38cfd7abcf6f1c1f23a9f74dfd3181531d8c0f
Reviewed-on: https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/7263
Integration-Tests: Jenkins <[email protected]>
Tested-by: Jenkins <[email protected]>
Reviewed-by: Dmitry Lychagin <[email protected]>


> Use code point as the unit in string-related functions
> ------------------------------------------------------
>
>                 Key: ASTERIXDB-2762
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-2762
>             Project: Apache AsterixDB
>          Issue Type: New Feature
>            Reporter: Rui Guo
>            Priority: Minor
>
> Currently, we use byte or Java char (2 byte) as the unit in the 
> string-related built-in functions. For example, in substr(string, offset, 
> len) the offset and len are in the unit of Java char. However, for 
> non-English characters such as Emoji and Korean, one character may have 
> multiple bytes and thus it is possible to split one character into a few 
> illegal parts.
> Using code point as the unit would be a more natural way to deal with those 
> multi-byte characters.
>  
> List of functions that need to be updated:
> ||Function||Expected output||
> |substr("🇺🇸🇨🇳", 2, 2);|"🇨🇳"|
> |trim("🇺🇸", "🇺");|"🇸"|
> |reverse("🇨🇳");|"🇳🇨" |
> |length(“🇨🇳");| 2|
> | position("👩‍👩‍👧‍👦🇨🇳", "🇨🇳");| 7|
> |regexp_contains("🇨🇳", "🇨🇳");regexp_like("🇨🇳", "🇨🇳");|true
> true |
> |string_to_codepoint("🇺🇸");| [ 127482, 127480 ]|
> |codepoint_to_string([ 127482, 127480]);|"🇺🇸" |
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ASTERIXDB-2762) Use code point as the unit in string-related functions

Reply via email to