alamb commented on issue #6058: URL: https://github.com/apache/arrow-rs/issues/6058#issuecomment-2761665362
> [@alamb](https://github.com/alamb) [@XiangpengHao](https://github.com/XiangpengHao) Is utf8 validation in parquet reader necessary? I found a large proportion of `parquet::arrow::buffer::offset_buffer::OffsetBuffer<I>::check_valid_utf8` when profiling datafusion-comet native scan. I think it depends on how much you trust your input files to be valid. If you trust the files to only contain valid utf8 data, the disabling UTF8 validation is certainly an option However, I think disabling this check would be somewhat cheating on benchmarks as real systems should be validating all user supplied input for safety. Here is a ticket describing the - https://github.com/apache/arrow-rs/issues/6701 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
