maartenbreddels commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-647572424


   Validating the utf8 string made the results slightly slower, but still much 
better then the initial results.
   
   Invalid utf8 characters are now replaced by a '?', as commented in the code. 
The unicode \U+FFFD would be more appropriate, but can lead to string length 
growth (3x). I think we can discuss this separately from this PR.
   
   Recap:
   Because we cannot use unlib (license issue), and utf8proc gives worse 
performance (even when inlined), we now have our own utf8 encode/decode. Also, 
calling upper and lower case functions in utf8proc is quite slow, and is now 
implemented with a lookup table (also suggested in 
https://github.com/JuliaStrings/utf8proc/issues/12#issuecomment-645563386) for 
codepoints up to `0xFFFF`.  
   
   Initial performance:
   ```
   Utf8Lower   193873803 ns    193823124 ns            3 
bytes_per_second=102.387M/s items_per_second=5.40996M/s
   Utf8Upper   197154929 ns    197093083 ns            4 
bytes_per_second=100.688M/s items_per_second=5.32021M/s
   ```
   
   Current performance:
   ```
   Utf8Lower           19677038 ns     19672550 ns           35 
bytes_per_second=1008.76M/s items_per_second=53.3015M/s
   Utf8Upper           20362432 ns     20360109 ns           34 
bytes_per_second=974.698M/s items_per_second=51.5015M/s
   ```
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to