[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

GitBox Mon, 29 Jun 2020 11:20:57 -0700


pitrou commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651282415



   > Having a benchmark run on non-ascii codepoints (I think we want to do this 
separate from this PR, but important point).
   
   Yes, I think we can defer that to a separate PR.
   
   > The existing decoder based on 
http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ was new to me. Very interesting 
work, but unfortunately led to a performance regression (~50->30 M/s), which 
I'm surprised about actually. Maybe worth comparing again when we have a 
benchmark with non-ascii codepoints.
   
   Yes, too. The main point of this state-machine-based decoder is that it's 
branchless, and so it will perform roughly as well on non-Ascii data with 
unpredictable branching. On pure Ascii data, a branch-based decoder may be 
faster since the branches will always be predicted right.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] pitrou commented on pull request #7449: ARROW-9133: [C++] Add utf8_upper and utf8_lower

Reply via email to