maartenbreddels commented on pull request #7449: URL: https://github.com/apache/arrow/pull/7449#issuecomment-647572424
Validating the utf8 string made the results slightly slower, but still much better then the initial results. Invalid utf8 characters are now replaced by a '?', as commented in the code. The unicode \U+FFFD would be more appropriate, but can lead to string length growth (3x). I think we can discuss this separately from this PR. Recap: Because we cannot use unlib (license issue), and utf8proc gives worse performance (even when inlined), we now have our own utf8 encode/decode. Also, calling upper and lower case functions in utf8proc is quite slow, and is now implemented with a lookup table (also suggested in https://github.com/JuliaStrings/utf8proc/issues/12#issuecomment-645563386) for codepoints up to `0xFFFF`. Initial performance: ``` Utf8Lower 193873803 ns 193823124 ns 3 bytes_per_second=102.387M/s items_per_second=5.40996M/s Utf8Upper 197154929 ns 197093083 ns 4 bytes_per_second=100.688M/s items_per_second=5.32021M/s ``` Current performance: ``` Utf8Lower 19677038 ns 19672550 ns 35 bytes_per_second=1008.76M/s items_per_second=53.3015M/s Utf8Upper 20362432 ns 20360109 ns 34 bytes_per_second=974.698M/s items_per_second=51.5015M/s ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org