hi I am busy dealing with a bug where the Azure abfs connector can get the prefetch data blocks of one thread/task overwritten by those of another task whose input stream was closed while a prefetch was in progress. https://issues.apache.org/jira/browse/HADOOP-18521
I have not been able to trigger any failures reading parquet data, presumably because it's seek-heavy read patterns don't benefit from prefetching much. Parquet also stores CRC checksums of pages of data written -which I need a bit of help understanding. 1. What data in a parquet file is covered by CRC checks, and are there any blocks of data (footers, summaries etc) which aren't checksummed? 2. I see that verification as set by "parquet.page.verify-checksum.enabled" is false by default. Why isn't it on? is there a significant performance hit. Thanks steve