[jira] [Commented] (ASTERIXDB-2762) Use code point as the unit in string-related functions

ASF subversion and git services (Jira) Thu, 23 Jul 2020 10:36:17 -0700


    [ 
https://issues.apache.org/jira/browse/ASTERIXDB-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163811#comment-17163811
 ]


ASF subversion and git services commented on ASTERIXDB-2762:
------------------------------------------------------------

Commit 4ce394b6a1ccce77d3d76052812a90d09e192e11 in asterixdb's branch 
refs/heads/master from Rui Guo
[ https://gitbox.apache.org/repos/asf?p=asterixdb.git;h=4ce394b ]

[ASTERIXDB-2762] Use code point as the unit in trim()

This commit aims to use code point as the unit in trim().

Currently, Java char (2 bytes) is used as the unit in trim(),
however, for non-English characters such as Emoji and Korean,
one character may have multiple bytes and thus can be trimmed
in an illegal way if we use Java char as the unit.
Instead, code point is a more natural unit to do so.

Change-Id: If14092be9c2a654dba392bb2b773db81c9e47ae6
Reviewed-on: https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/7283
Integration-Tests: Jenkins <[email protected]>
Tested-by: Jenkins <[email protected]>
Reviewed-by: Dmitry Lychagin <[email protected]>


> Use code point as the unit in string-related functions
> ------------------------------------------------------
>
>                 Key: ASTERIXDB-2762
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-2762
>             Project: Apache AsterixDB
>          Issue Type: New Feature
>            Reporter: Rui Guo
>            Priority: Minor
>
> Currently, we use byte or Java char (2 byte) as the unit in the 
> string-related built-in functions. For example, in substr(string, offset, 
> len) the offset and len are in the unit of Java char. However, for 
> non-English characters such as Emoji and Korean, one character may have 
> multiple bytes and thus it is possible to split one character into a few 
> illegal parts.
> Using code point as the unit would be a more natural way to deal with those 
> multi-byte characters.
>  
> List of functions that need to be updated:
> ||Function||Expected output||
> |substr("🇺🇸🇨🇳", 2, 2);|"🇨🇳"|
> |trim("🇺🇸", "🇺");|"🇸"|
> |reverse("🇨🇳");|"🇳🇨" |
> |length(“🇨🇳");| 2|
> | position("👩‍👩‍👧‍👦🇨🇳", "🇨🇳");| 7|
> |regexp_contains("🇨🇳", "🇨🇳");regexp_like("🇨🇳", "🇨🇳");|true
> true |
> |string_to_codepoint("🇺🇸");| [ 127482, 127480 ]|
> |codepoint_to_string([ 127482, 127480]);|"🇺🇸" |
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ASTERIXDB-2762) Use code point as the unit in string-related functions

Reply via email to