Re: Clarification on CRC checksum field

Boudewijn Braams Sun, 24 Feb 2019 13:24:49 -0800

Done!

On Thu, Feb 21, 2019 at 8:28 PM Ryan Blue <[email protected]> wrote:


> I'm not aware of any readers or writers using the CRC field. I think it
> would be great to clean up the spec and make it more clear. Want to submit
> a PR to parquet-format for this?
>
> Thanks!
>
> On Thu, Feb 21, 2019 at 6:48 AM Boudewijn Braams <
> [email protected]> wrote:
>
> > Hi all,
> >
> > Although a page-level CRC field is defined in the Thrift specification,
> > currently neither parquet-cpp nor parquet-mr seem to leverage it.
> >
> > Having these checksums will allow us to do localized detection of
> > corruptions and provides a means for reasoning about where in the
> > write/read path a corruption may have been introduced in a production
> > system. Admittedly these checksums can only help us detect corruptions
> > produced in specific situations and will not capture all forms of data
> > corruptions. However, with respect to aiding data corruption
> investigations
> > in production they can prove still useful (even if writing/validating the
> > checksum is to be opt-in).
> >
> > The comment in the Thrift specification (
> >
> >
> https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607
> > )
> > reads ‘32bit crc for the data below’, which is somewhat ambiguous to what
> > exactly constitutes the ‘data’ that the checksum should be calculated on
> > (does it include the page header, is it on compressed or uncompressed
> > data?).
> >
> > Is anybody aware of systems that are actually already leveraging the CRC
> > field? And if not, should we have a discussion on refining the spec to
> > remove the ambiguity?
> >
> > Thank you,
> > Boudewijn
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Clarification on CRC checksum field

Reply via email to