maartenbreddels commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-644949978


   It's not *that* slow, it was 40% of Vaex' performance (single threaded), so 
I think there is a bit more to be gained still. But I have added an 
optimization that tries ASCII conversion first. This gives it a 7x (compared to 
Vaex) to 10x speedup (in the benchmarks).
   
   Before:
   ```
   Utf8Lower   193873803 ns    193823124 ns            3 
bytes_per_second=102.387M/s items_per_second=5.40996M/s
   Utf8Upper   197154929 ns    197093083 ns            4 
bytes_per_second=100.688M/s items_per_second=5.32021M/s
   ```
   
   After:
   ```
   Utf8Lower    19508443 ns     19493652 ns           36 
bytes_per_second=1018.02M/s items_per_second=53.7906M/s
   Utf8Upper    19846885 ns     19832066 ns           35 
bytes_per_second=1000.65M/s items_per_second=52.8728M/s
   ```
   
   There is one loose end, the growth of the string can cause a utf8 array to 
be promoted to a large_utf8. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to