I'm not aware of any readers or writers using the CRC field. I think it would be great to clean up the spec and make it more clear. Want to submit a PR to parquet-format for this?
Thanks! On Thu, Feb 21, 2019 at 6:48 AM Boudewijn Braams < [email protected]> wrote: > Hi all, > > Although a page-level CRC field is defined in the Thrift specification, > currently neither parquet-cpp nor parquet-mr seem to leverage it. > > Having these checksums will allow us to do localized detection of > corruptions and provides a means for reasoning about where in the > write/read path a corruption may have been introduced in a production > system. Admittedly these checksums can only help us detect corruptions > produced in specific situations and will not capture all forms of data > corruptions. However, with respect to aiding data corruption investigations > in production they can prove still useful (even if writing/validating the > checksum is to be opt-in). > > The comment in the Thrift specification ( > > https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607 > ) > reads ‘32bit crc for the data below’, which is somewhat ambiguous to what > exactly constitutes the ‘data’ that the checksum should be calculated on > (does it include the page header, is it on compressed or uncompressed > data?). > > Is anybody aware of systems that are actually already leveraging the CRC > field? And if not, should we have a discussion on refining the spec to > remove the ambiguity? > > Thank you, > Boudewijn > -- Ryan Blue Software Engineer Netflix
