I'm not aware of any readers or writers using the CRC field. I think it
would be great to clean up the spec and make it more clear. Want to submit
a PR to parquet-format for this?

Thanks!

On Thu, Feb 21, 2019 at 6:48 AM Boudewijn Braams <
[email protected]> wrote:

> Hi all,
>
> Although a page-level CRC field is defined in the Thrift specification,
> currently neither parquet-cpp nor parquet-mr seem to leverage it.
>
> Having these checksums will allow us to do localized detection of
> corruptions and provides a means for reasoning about where in the
> write/read path a corruption may have been introduced in a production
> system. Admittedly these checksums can only help us detect corruptions
> produced in specific situations and will not capture all forms of data
> corruptions. However, with respect to aiding data corruption investigations
> in production they can prove still useful (even if writing/validating the
> checksum is to be opt-in).
>
> The comment in the Thrift specification (
>
> https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607
> )
> reads ‘32bit crc for the data below’, which is somewhat ambiguous to what
> exactly constitutes the ‘data’ that the checksum should be calculated on
> (does it include the page header, is it on compressed or uncompressed
> data?).
>
> Is anybody aware of systems that are actually already leveraging the CRC
> field? And if not, should we have a discussion on refining the spec to
> remove the ambiguity?
>
> Thank you,
> Boudewijn
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to