[ 
https://issues.apache.org/jira/browse/ASTERIXDB-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164112#comment-17164112
 ] 

ASF subversion and git services commented on ASTERIXDB-2762:
------------------------------------------------------------

Commit effc50a30f3d284415eae98b1708f9c6303679a8 in asterixdb's branch 
refs/heads/master from Rui Guo
[ https://gitbox.apache.org/repos/asf?p=asterixdb.git;h=effc50a ]

[ASTERIXDB-2762] Count code points in string length()

This commit aims to let the string length() built-in function to count
the number of code points instead of the number of Java Chars in a string.

Change-Id: I3ff25840adc94b4a688c53a06816d5934c6418ad
Reviewed-on: https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/7304
Integration-Tests: Jenkins <jenk...@fulliautomatix.ics.uci.edu>
Tested-by: Jenkins <jenk...@fulliautomatix.ics.uci.edu>
Reviewed-by: Dmitry Lychagin <dmitry.lycha...@couchbase.com>


> Use code point as the unit in string-related functions
> ------------------------------------------------------
>
>                 Key: ASTERIXDB-2762
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-2762
>             Project: Apache AsterixDB
>          Issue Type: New Feature
>            Reporter: Rui Guo
>            Priority: Minor
>
> Currently, we use byte or Java char (2 byte) as the unitΒ in the 
> string-related built-in functions. For example, in substr(string, offset, 
> len) the offset and len are in the unit of Java char. However, for 
> non-English characters such as Emoji and Korean, one character may have 
> multiple bytes and thus it is possible to split one character into a few 
> illegal parts.
> Using code point as the unit would be a more natural way to deal with those 
> multi-byte characters.
> Β 
> List of functions that need to be updated:
> ||Function||Expected output||
> |substr("πŸ‡ΊπŸ‡ΈπŸ‡¨πŸ‡³", 2, 2);|"πŸ‡¨πŸ‡³"|
> |trim("πŸ‡ΊπŸ‡Έ", "πŸ‡Ί");|"πŸ‡Έ"|
> |reverse("πŸ‡¨πŸ‡³");|"πŸ‡³πŸ‡¨"Β |
> |length(β€œπŸ‡¨πŸ‡³");|Β 2|
> |Β position("πŸ‘©β€πŸ‘©β€πŸ‘§β€πŸ‘¦πŸ‡¨πŸ‡³", "πŸ‡¨πŸ‡³");|Β 7|
> |regexp_contains("πŸ‡¨πŸ‡³", "πŸ‡¨πŸ‡³");regexp_like("πŸ‡¨πŸ‡³", "πŸ‡¨πŸ‡³");|true
> trueΒ |
> |string_to_codepoint("πŸ‡ΊπŸ‡Έ");|Β [ 127482, 127480 ]|
> |codepoint_to_string([ 127482, 127480]);|"πŸ‡ΊπŸ‡Έ"Β |
> Β 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to