maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-644949978
It's not *that* slow, it was 40% of Vaex' performance (single threaded), so I think there is a bit more to be gained still. But I have added an optimization that tries ASCII conversion first. This gives it a 7x (compared to Vaex) to 10x speedup (in the benchmarks). Before: ``` Utf8Lower 193873803 ns 193823124 ns 3 bytes_per_second=102.387M/s items_per_second=5.40996M/s Utf8Upper 197154929 ns 197093083 ns 4 bytes_per_second=100.688M/s items_per_second=5.32021M/s ``` After: ``` Utf8Lower 19508443 ns 19493652 ns 36 bytes_per_second=1018.02M/s items_per_second=53.7906M/s Utf8Upper 19846885 ns 19832066 ns 35 bytes_per_second=1000.65M/s items_per_second=52.8728M/s ``` There is one loose end, the growth of the string can cause a utf8 array to be promoted to a large_utf8. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org