[
https://issues.apache.org/jira/browse/ASTERIXDB-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166064#comment-17166064
]
ASF subversion and git services commented on ASTERIXDB-2762:
------------------------------------------------------------
Commit 120d7eac49ad855eb1ae8a295683c0250aa4fe9e in asterixdb's branch
refs/heads/master from Rui Guo
[ https://gitbox.apache.org/repos/asf?p=asterixdb.git;h=120d7ea ]
[ASTERIXDB-2762] Use code point as unit in position()
Change-Id: Icf1b8b3401599e4332dd09534bdf4787cd9d85d6
Reviewed-on: https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/7305
Integration-Tests: Jenkins <[email protected]>
Tested-by: Jenkins <[email protected]>
Reviewed-by: Dmitry Lychagin <[email protected]>
> Use code point as the unit in string-related functions
> ------------------------------------------------------
>
> Key: ASTERIXDB-2762
> URL: https://issues.apache.org/jira/browse/ASTERIXDB-2762
> Project: Apache AsterixDB
> Issue Type: New Feature
> Reporter: Rui Guo
> Priority: Minor
>
> Currently, we use byte or Java char (2 byte) as the unitΒ in the
> string-related built-in functions. For example, in substr(string, offset,
> len) the offset and len are in the unit of Java char. However, for
> non-English characters such as Emoji and Korean, one character may have
> multiple bytes and thus it is possible to split one character into a few
> illegal parts.
> Using code point as the unit would be a more natural way to deal with those
> multi-byte characters.
> Β
> List of functions that need to be updated:
> ||Function||Expected output||
> |substr("πΊπΈπ¨π³", 2, 2);|"π¨π³"|
> |trim("πΊπΈ", "πΊ");|"πΈ"|
> |reverse("π¨π³");|"π³π¨"Β |
> |length(βπ¨π³");|Β 2|
> |Β position("π©βπ©βπ§βπ¦π¨π³", "π¨π³");|Β 7|
> |regexp_contains("π¨π³", "π¨π³");regexp_like("π¨π³", "π¨π³");|true
> trueΒ |
> |string_to_codepoint("πΊπΈ");|Β [ 127482, 127480 ]|
> |codepoint_to_string([ 127482, 127480]);|"πΊπΈ"Β |
> Β
--
This message was sent by Atlassian Jira
(v8.3.4#803005)