maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-645546979
I've added my own utf encode/decode for now. With lookup tables I now get: ``` Utf8Lower_median 18414820 ns 18408392 ns 3 bytes_per_second=1078.04M/s items_per_second=56.9618M/s Utf8Upper_median 17004210 ns 17003407 ns 3 bytes_per_second=1.13976G/s items_per_second=61.6686M/s ``` which is faster than the 'ascii' version implemented previously (that got `items_per_second=53 M/s`). Benchmark results vary a lot between `items_per_second=55-66M/s` . Using utf8proc's encode/decode (inlined), this goes down to `18M/s`. I have to look a bit into why that is the case since they do a bit more sanity checking. Ideally, some of this goes upstream. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org