Hi all,

Although a page-level CRC field is defined in the Thrift specification,
currently neither parquet-cpp nor parquet-mr seem to leverage it.

Having these checksums will allow us to do localized detection of
corruptions and provides a means for reasoning about where in the
write/read path a corruption may have been introduced in a production
system. Admittedly these checksums can only help us detect corruptions
produced in specific situations and will not capture all forms of data
corruptions. However, with respect to aiding data corruption investigations
in production they can prove still useful (even if writing/validating the
checksum is to be opt-in).

The comment in the Thrift specification (
https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607)
reads ‘32bit crc for the data below’, which is somewhat ambiguous to what
exactly constitutes the ‘data’ that the checksum should be calculated on
(does it include the page header, is it on compressed or uncompressed
data?).

Is anybody aware of systems that are actually already leveraging the CRC
field? And if not, should we have a discussion on refining the spec to
remove the ambiguity?

Thank you,
Boudewijn

Reply via email to