parquet checksum coverage

Steve Loughran Mon, 14 Nov 2022 03:39:30 -0800

hi

I am busy dealing with a bug where the Azure abfs connector can get the
prefetch data blocks of one thread/task overwritten by those of another
task whose input stream was closed while a prefetch was in progress.
https://issues.apache.org/jira/browse/HADOOP-18521


I have not been able to trigger any failures reading parquet data,
presumably because it's seek-heavy read patterns don't benefit from
prefetching much.

Parquet also stores CRC checksums of pages of data written -which I need a
bit of help understanding.


   1. What data in a parquet file is covered by CRC checks, and are there
   any blocks of data (footers, summaries etc) which aren't checksummed?
   2. I see that verification as set
   by "parquet.page.verify-checksum.enabled" is false by default. Why isn't it
   on? is there a significant performance hit.


Thanks

steve

parquet checksum coverage

Reply via email to