tustvold commented on a change in pull request #1082:
URL: https://github.com/apache/arrow-rs/pull/1082#discussion_r785424844



##########
File path: parquet/src/arrow/array_reader/byte_array.rs
##########
@@ -192,7 +211,16 @@ impl<I: OffsetSizeTrait + ScalarValue> OffsetBuffer<I> {
         self.offsets.len() - 1
     }
 
-    fn try_push(&mut self, data: &[u8]) -> Result<()> {
+    fn try_push(&mut self, data: &[u8], validate_utf8: bool) -> Result<()> {
+        if validate_utf8 {
+            if let Err(e) = std::str::from_utf8(data) {

Review comment:
       So I did some experimentation:
   
   It is **significantly** faster to verify on push that the first byte is a 
valid start UTF-8 codepoint, and then do UTF-8 validation on the larger buffer 
in one go, it takes the performance hit on PLAIN encoded strings to ~1.1x down 
from ~2x. I have modified the code to do this.
   
   With this optimisation applied, changing to simdutf8 made only a very minor 
~6% improvement on PLAIN encoded strings, which reduced to no appreciable 
difference with RLE encoded strings. This may be my machine, or the lack of 
non-ASCII characters in the input, but I'm going to leave this out for now. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to