maartenbreddels commented on pull request #7656: URL: https://github.com/apache/arrow/pull/7656#issuecomment-656627248
Not sure how `C++ / AMD64 Ubuntu 18.04 C++ ASAN UBSAN (pull_request)` can fail, it doesn't see U+A7BA as upper case (added in unicode 13). Does that build maybe pick up an older version of utf8proc somehow? It's now as fast as what I have in Vaex: ``` IsAlphaNumericAscii_median 14829307 ns 14828665 ns 3 bytes_per_second=1068.54M/s items_per_second=70.7128M/s IsAlphaNumericUnicode_median 14178411 ns 14177657 ns 3 bytes_per_second=1117.6M/s items_per_second=73.9598M/s ``` The ascii version can be a bit faster (8%) by hand writing any_of, but that might be case specific, I prefer using stl. If we merge this soon, I prefer to keep the Python unittests in, as a code reminder of what is missing from utf8proc. They seem to be willing to look into supporting `islower` better (https://github.com/JuliaStrings/utf8proc/issues/195). I can open an issue to track that. The difference between Python and Arrow's isnumeric/isidigit I personally don't mind that much (I guess that's much less used), but it would be nice to be able to match Python exactly. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
