hi

I am busy dealing with a bug where the Azure abfs connector can get the
prefetch data blocks of one thread/task overwritten by those of another
task whose input stream was closed while a prefetch was in progress.
https://issues.apache.org/jira/browse/HADOOP-18521

I have not been able to trigger any failures reading parquet data,
presumably because it's seek-heavy read patterns don't benefit from
prefetching much.

Parquet also stores CRC checksums of pages of data written -which I need a
bit of help understanding.


   1. What data in a parquet file is covered by CRC checks, and are there
   any blocks of data (footers, summaries etc) which aren't checksummed?
   2. I see that verification as set
   by "parquet.page.verify-checksum.enabled" is false by default. Why isn't it
   on? is there a significant performance hit.


Thanks

steve

Reply via email to