[ https://issues.apache.org/jira/browse/ARROW-9133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoine Pitrou updated ARROW-9133: ---------------------------------- Summary: [C++] Add utf8_upper and utf8_lower (was: [C++] Add utf8_upper and utf_lower) > [C++] Add utf8_upper and utf8_lower > ----------------------------------- > > Key: ARROW-9133 > URL: https://issues.apache.org/jira/browse/ARROW-9133 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Maarten Breddels > Assignee: Maarten Breddels > Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > Time Spent: 14.5h > Remaining Estimate: 0h > > This is the equivalent of https://issues.apache.org/jira/browse/ARROW-9100 > for utf8. This will be a good test for unilib vs utf8proc, performance, and > API wise. > Also, since Unicode strings can grow and shrink, this is also a good start to > think about a strategy for memory allocation. > How much can a 'string' (or byte sequence) length actually grow? > Item 5.18 mentioned that a string can expand by a factor of 3, by which they > seem to mean 3 codepoints. This can be validated by checking with Python: > {code:python} > for i in range(0x100, 0x110000): > codepoint = chr(i) > try: > bytes_before = codepoint.encode() > except UnicodeEncodeError: > continue > bytes_after = codepoint.upper().encode() > if len(bytes_before) != len(bytes_after): > print(i, hex(i), codepoint, codepoint.lower(), len(bytes_before), > len(bytes_after)) > .... > 912 0x390 ΐ Ϊ́ 2 6 > ...{code} > showing that a two-byte codepoint can expand to 3 (2 byte) codepoints (2 > bytes => 6 bytes). The character Ϊ́ has no single precomposed capital > character, so it is composed of a single base character and two combining > characters. However there are different situations explain in > [https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt]) > This increase by a factor of 3 is used in CPython > [https://github.com/python/cpython/blob/25f38d7044a3a47465edd851c4e04f337b2c4b9b/Objects/unicodeobject.c#L10058] > which is an easy solution not to have to grow the buffer dynamically. > However, growing 3x in size seems at odds with the API of both utf8proc: > [https://github.com/JuliaStrings/utf8proc/blob/08f9999a0698639f15d07b12c0065a4494f2d504/utf8proc.c#L375] > [https://github.com/ufal/unilib/blob/d8276e70b7c11c677897f71030de7258cbb1f99e/unilib/unicode.h#L79] > and unilib: > [https://github.com/ufal/unilib/blob/d8276e70b7c11c677897f71030de7258cbb1f99e/unilib/unicode.h#L79] > Which can only return a single 32bit value (thus 1 codepoint, encoding 1 > character). Both libraries seem to ignore the special cases of case mapping > (no library uses/downloads SpecialCasing.txt). > This means that if Arrow wants to support the same features as Python > regarding upper casing and lower casing (which means really implementing the > Unicode), neither libraries are sufficient. > There are more edges cases/irregularities. But I propose I start with a > version of utf8_lower and utf8_upper that ignore the special cases. > > PS: > Another interesting finding is that although upper casing can increase a > buffer length by a factor of 3, lowercasing a utf8 string will only increase > the byte length by a factor of 3/2 at maximum. > {code:python} > for i in range(0x100, 0x110000): > codepoint = chr(i) > try: > bytes_before = codepoint.encode() > except UnicodeEncodeError: > continue > bytes_after = codepoint.lower().encode() > if len(bytes_before) != len(bytes_after): > print(i, hex(i), codepoint, codepoint.lower(), len(bytes_before), > len(bytes_after)) > 304 0x130 İ i̇ 2 3 > 570 0x23a Ⱥ ⱥ 2 3 > 574 0x23e Ⱦ ⱦ 2 3 > 7838 0x1e9e ẞ ß 3 2 > 8486 0x2126 Ω ω 3 2 > 8490 0x212a K k 3 1 > 8491 0x212b Å å 3 2 > 11362 0x2c62 Ɫ ɫ 3 2 > 11364 0x2c64 Ɽ ɽ 3 2 > 11373 0x2c6d Ɑ ɑ 3 2 > 11374 0x2c6e Ɱ ɱ 3 2 > 11375 0x2c6f Ɐ ɐ 3 2 > 11376 0x2c70 Ɒ ɒ 3 2 > 11390 0x2c7e Ȿ ȿ 3 2 > 11391 0x2c7f Ɀ ɀ 3 2 > 42893 0xa78d Ɥ ɥ 3 2 > 42922 0xa7aa Ɦ ɦ 3 2 > 42923 0xa7ab Ɜ ɜ 3 2 > 42924 0xa7ac Ɡ ɡ 3 2 > 42925 0xa7ad Ɬ ɬ 3 2 > 42926 0xa7ae Ɪ ɪ 3 2 > 42928 0xa7b0 Ʞ ʞ 3 2 > 42929 0xa7b1 Ʇ ʇ 3 2 > 42930 0xa7b2 Ʝ ʝ 3 2 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)