maartenbreddels commented on pull request #7449:
URL: https://github.com/apache/arrow/pull/7449#issuecomment-651165626


   @pitrou many thanks for the review. I've implemented all you suggestions 
except:
    * Raising an error on invalid utf8 data (see comment)
    * Having a benchmark run on non-ascii codepoints (I think we want to do 
this separate from this PR, but important point).
   
   Btw, I wasn't aware of existing utf8 code already in Arrow. The existing 
decoder based on http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ was new to me. 
Very interesting work, but unfortunately led to a performance regression 
(~50->30 M/s), which I'm surprised about actually. Maybe worth comparing again 
when we have a benchmark with non-ascii codepoints.
   
   @wesm I hope this is ready to go 👍 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to