maartenbreddels opened a new pull request #7656:
URL: https://github.com/apache/arrow/pull/7656


   Quite a few issues showed up:
    * utf8proc doesn't store and expose the information if a codepoint is of a 
Numeric type, thus we cannot implement isdigit/isnumeric (and also isalnum) 
correctly for the unicode versions.
    * utf8proc doesn't store and expose extra information about casing (from 
DerivedCoreProperties.txt ), such as that a Roman letter Ⅿ, is a digit, but 
also an upper case letter. For isupper, I have added that information (since 
they are just in a few blocks). For islower, this is quite a list. We could 
have some way to encoded this, but that could be a maintenance burden.
    * It seems utf8proc (incorrectly?) claims some undefined codepoints (e.g. 
https://www.compart.com/en/unicode/U+08BE) are UTF8PROC_CATEGORY_LO (General 
category Letter Other). This has an effect on isalpha, isprintable, isupper, 
islower.
   
   These issues showed up when writing the Python test to compareagainst 
CPython. The test lists all the issues and all the codepoints that give issues.
   
   I wonder what the best way forward is? Ideally, libutf8proc would implement 
this.
   
   Would this be a reason to not merge this PR, or can we live with this (I'm 
ok with it as it is, minus the performance). 
   
   # Performance:
   ```
   IsAlphaAscii_median     13929427 ns     13925663 ns            3 
bytes_per_second=1.11116G/s items_per_second=75.2981M/s
   IsAlphaUnicode_median   35342242 ns     35338347 ns            3 
bytes_per_second=448.378M/s items_per_second=29.6725M/s
   ```
   
   I think the performance is reasonable for ascii, but I think we can do a 
lookup table for the call to `utf8proc_category(..)`, which should speed up the 
unicode version.
   
   # Naming
   I've used the `_ascii` and `_unicode` suffixes, instead of `_utf8`, since 
for semantic, we don't care about the encoding. It should not matter how the 
string is encoded (utf8/16/32).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to