maartenbreddels commented on pull request #7656:
URL: https://github.com/apache/arrow/pull/7656#issuecomment-656627248


   Not sure how `C++ / AMD64 Ubuntu 18.04 C++ ASAN UBSAN (pull_request)` can 
fail, it doesn't see U+A7BA as upper case (added in unicode 13). Does that 
build maybe pick up an older version of utf8proc somehow?
   
   It's now as fast as what I have in Vaex:
   ```
   IsAlphaNumericAscii_median     14829307 ns     14828665 ns            3 
bytes_per_second=1068.54M/s items_per_second=70.7128M/s
   IsAlphaNumericUnicode_median   14178411 ns     14177657 ns            3 
bytes_per_second=1117.6M/s items_per_second=73.9598M/s
   ```
   
   The ascii version can be a bit faster (8%) by hand writing any_of, but that 
might be case specific, I prefer using stl.
   
   If we merge this soon, I prefer to keep the Python unittests in, as a code 
reminder of what is missing from utf8proc. They seem to be willing to look into 
supporting `islower` better 
(https://github.com/JuliaStrings/utf8proc/issues/195). I can open an issue to 
track that. The difference between Python and Arrow's isnumeric/isidigit I 
personally don't mind that much (I guess that's much less used), but it would 
be nice to be able to match Python exactly.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to