Hi all, Although a page-level CRC field is defined in the Thrift specification, currently neither parquet-cpp nor parquet-mr seem to leverage it.
Having these checksums will allow us to do localized detection of corruptions and provides a means for reasoning about where in the write/read path a corruption may have been introduced in a production system. Admittedly these checksums can only help us detect corruptions produced in specific situations and will not capture all forms of data corruptions. However, with respect to aiding data corruption investigations in production they can prove still useful (even if writing/validating the checksum is to be opt-in). The comment in the Thrift specification ( https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607) reads ‘32bit crc for the data below’, which is somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum should be calculated on (does it include the page header, is it on compressed or uncompressed data?). Is anybody aware of systems that are actually already leveraging the CRC field? And if not, should we have a discussion on refining the spec to remove the ambiguity? Thank you, Boudewijn
