tustvold commented on a change in pull request #1082: URL: https://github.com/apache/arrow-rs/pull/1082#discussion_r785424844
########## File path: parquet/src/arrow/array_reader/byte_array.rs ########## @@ -192,7 +211,16 @@ impl<I: OffsetSizeTrait + ScalarValue> OffsetBuffer<I> { self.offsets.len() - 1 } - fn try_push(&mut self, data: &[u8]) -> Result<()> { + fn try_push(&mut self, data: &[u8], validate_utf8: bool) -> Result<()> { + if validate_utf8 { + if let Err(e) = std::str::from_utf8(data) { Review comment: So I did some experimentation: It is **significantly** faster to verify on push that the first byte is a valid start UTF-8 codepoint, and then do UTF-8 validation on the larger buffer in one go, it takes the performance hit on PLAIN encoded strings to ~1.1x down from ~2x. I have modified the code to do this. With this optimisation applied, changing to simdutf8 made only a very minor ~6% improvement on PLAIN encoded strings, which reduced to no appreciable difference with RLE encoded strings. This may be my machine, or the lack of non-ASCII characters in the input, but I'm going to leave this out for now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org