tustvold commented on a change in pull request #1082: URL: https://github.com/apache/arrow-rs/pull/1082#discussion_r785454161
########## File path: parquet/src/arrow/array_reader/byte_array.rs ########## @@ -192,7 +211,16 @@ impl<I: OffsetSizeTrait + ScalarValue> OffsetBuffer<I> { self.offsets.len() - 1 } - fn try_push(&mut self, data: &[u8]) -> Result<()> { + fn try_push(&mut self, data: &[u8], validate_utf8: bool) -> Result<()> { + if validate_utf8 { + if let Err(e) = std::str::from_utf8(data) { Review comment: I _think_ this should be sufficient, but I'm not an expert on UTF-8. My reasoning is that when you slice a `str` all it validates are that the start and end offsets pass `std::str::is_char_boundary` - [here](https://doc.rust-lang.org/std/primitive.str.html#panics-3). Taking that the standard library is correct, and the only invariant of `str` is that the bytes are UTF-8 as a whole, I think this is no different? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org