[jira] [Commented] (ASTERIXDB-2762) Use code point as the unit in string-related functions

ASF subversion and git services (Jira) Tue, 28 Jul 2020 16:35:18 -0700


    [ 
https://issues.apache.org/jira/browse/ASTERIXDB-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166752#comment-17166752
 ]


ASF subversion and git services commented on ASTERIXDB-2762:
------------------------------------------------------------

Commit c961938ea6187350b39f1bf432b6e7b124299524 in asterixdb's branch 
refs/heads/master from Rui Guo
[ https://gitbox.apache.org/repos/asf?p=asterixdb.git;h=c961938 ]

[ASTERIXDB-2762] Fix str_to_codepoint() and codepoint_to_str()

This commit aims to fix bugs in the two functions.

Previously, for surrogate-pair characters (those who have 4 bytes
or 2 Java chars in UTF-16 instead of 2 bytes or 1 Java char)
the two functions didn't work fine.
The code point of such a character was an integer pair (due to two Java
chars in the encoding) instead of one integer, and this was not expected.

Change-Id: I93563b90e8d4f77886e1cb3ed67519fd0968c95d
Reviewed-on: https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/7306
Integration-Tests: Jenkins <[email protected]>
Tested-by: Jenkins <[email protected]>
Reviewed-by: Dmitry Lychagin <[email protected]>


> Use code point as the unit in string-related functions
> ------------------------------------------------------
>
>                 Key: ASTERIXDB-2762
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-2762
>             Project: Apache AsterixDB
>          Issue Type: New Feature
>            Reporter: Rui Guo
>            Priority: Minor
>
> Currently, we use byte or Java char (2 byte) as the unit in the 
> string-related built-in functions. For example, in substr(string, offset, 
> len) the offset and len are in the unit of Java char. However, for 
> non-English characters such as Emoji and Korean, one character may have 
> multiple bytes and thus it is possible to split one character into a few 
> illegal parts.
> Using code point as the unit would be a more natural way to deal with those 
> multi-byte characters.
>  
> List of functions that need to be updated:
> ||Function||Expected output||
> |substr("🇺🇸🇨🇳", 2, 2);|"🇨🇳"|
> |trim("🇺🇸", "🇺");|"🇸"|
> |reverse("🇨🇳");|"🇳🇨" |
> |length(“🇨🇳");| 2|
> | position("👩‍👩‍👧‍👦🇨🇳", "🇨🇳");| 7|
> |regexp_contains("🇨🇳", "🇨🇳");regexp_like("🇨🇳", "🇨🇳");|true
> true |
> |string_to_codepoint("🇺🇸");| [ 127482, 127480 ]|
> |codepoint_to_string([ 127482, 127480]);|"🇺🇸" |
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ASTERIXDB-2762) Use code point as the unit in string-related functions

Reply via email to