[ https://issues.apache.org/jira/browse/ASTERIXDB-2762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164112#comment-17164112 ]
ASF subversion and git services commented on ASTERIXDB-2762: ------------------------------------------------------------ Commit effc50a30f3d284415eae98b1708f9c6303679a8 in asterixdb's branch refs/heads/master from Rui Guo [ https://gitbox.apache.org/repos/asf?p=asterixdb.git;h=effc50a ] [ASTERIXDB-2762] Count code points in string length() This commit aims to let the string length() built-in function to count the number of code points instead of the number of Java Chars in a string. Change-Id: I3ff25840adc94b4a688c53a06816d5934c6418ad Reviewed-on: https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/7304 Integration-Tests: Jenkins <jenk...@fulliautomatix.ics.uci.edu> Tested-by: Jenkins <jenk...@fulliautomatix.ics.uci.edu> Reviewed-by: Dmitry Lychagin <dmitry.lycha...@couchbase.com> > Use code point as the unit in string-related functions > ------------------------------------------------------ > > Key: ASTERIXDB-2762 > URL: https://issues.apache.org/jira/browse/ASTERIXDB-2762 > Project: Apache AsterixDB > Issue Type: New Feature > Reporter: Rui Guo > Priority: Minor > > Currently, we use byte or Java char (2 byte) as the unitΒ in the > string-related built-in functions. For example, in substr(string, offset, > len) the offset and len are in the unit of Java char. However, for > non-English characters such as Emoji and Korean, one character may have > multiple bytes and thus it is possible to split one character into a few > illegal parts. > Using code point as the unit would be a more natural way to deal with those > multi-byte characters. > Β > List of functions that need to be updated: > ||Function||Expected output|| > |substr("πΊπΈπ¨π³", 2, 2);|"π¨π³"| > |trim("πΊπΈ", "πΊ");|"πΈ"| > |reverse("π¨π³");|"π³π¨"Β | > |length(βπ¨π³");|Β 2| > |Β position("π©βπ©βπ§βπ¦π¨π³", "π¨π³");|Β 7| > |regexp_contains("π¨π³", "π¨π³");regexp_like("π¨π³", "π¨π³");|true > trueΒ | > |string_to_codepoint("πΊπΈ");|Β [ 127482, 127480 ]| > |codepoint_to_string([ 127482, 127480]);|"πΊπΈ"Β | > Β -- This message was sent by Atlassian Jira (v8.3.4#803005)