maartenbreddels opened a new pull request #7656:
URL: https://github.com/apache/arrow/pull/7656
Quite a few issues showed up:
* utf8proc doesn't store and expose the information if a codepoint is of a
Numeric type, thus we cannot implement isdigit/isnumeric (and also isalnum)
correctly for the unicode versions.
* utf8proc doesn't store and expose extra information about casing (from
DerivedCoreProperties.txt ), such as that a Roman letter Ⅿ, is a digit, but
also an upper case letter. For isupper, I have added that information (since
they are just in a few blocks). For islower, this is quite a list. We could
have some way to encoded this, but that could be a maintenance burden.
* It seems utf8proc (incorrectly?) claims some undefined codepoints (e.g.
https://www.compart.com/en/unicode/U+08BE) are UTF8PROC_CATEGORY_LO (General
category Letter Other). This has an effect on isalpha, isprintable, isupper,
islower.
These issues showed up when writing the Python test to compareagainst
CPython. The test lists all the issues and all the codepoints that give issues.
I wonder what the best way forward is? Ideally, libutf8proc would implement
this.
Would this be a reason to not merge this PR, or can we live with this (I'm
ok with it as it is, minus the performance).
# Performance:
```
IsAlphaAscii_median 13929427 ns 13925663 ns 3
bytes_per_second=1.11116G/s items_per_second=75.2981M/s
IsAlphaUnicode_median 35342242 ns 35338347 ns 3
bytes_per_second=448.378M/s items_per_second=29.6725M/s
```
I think the performance is reasonable for ascii, but I think we can do a
lookup table for the call to `utf8proc_category(..)`, which should speed up the
unicode version.
# Naming
I've used the `_ascii` and `_unicode` suffixes, instead of `_utf8`, since
for semantic, we don't care about the encoding. It should not matter how the
string is encoded (utf8/16/32).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]