Done! On Thu, Feb 21, 2019 at 8:28 PM Ryan Blue <[email protected]> wrote:
> I'm not aware of any readers or writers using the CRC field. I think it > would be great to clean up the spec and make it more clear. Want to submit > a PR to parquet-format for this? > > Thanks! > > On Thu, Feb 21, 2019 at 6:48 AM Boudewijn Braams < > [email protected]> wrote: > > > Hi all, > > > > Although a page-level CRC field is defined in the Thrift specification, > > currently neither parquet-cpp nor parquet-mr seem to leverage it. > > > > Having these checksums will allow us to do localized detection of > > corruptions and provides a means for reasoning about where in the > > write/read path a corruption may have been introduced in a production > > system. Admittedly these checksums can only help us detect corruptions > > produced in specific situations and will not capture all forms of data > > corruptions. However, with respect to aiding data corruption > investigations > > in production they can prove still useful (even if writing/validating the > > checksum is to be opt-in). > > > > The comment in the Thrift specification ( > > > > > https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607 > > ) > > reads ‘32bit crc for the data below’, which is somewhat ambiguous to what > > exactly constitutes the ‘data’ that the checksum should be calculated on > > (does it include the page header, is it on compressed or uncompressed > > data?). > > > > Is anybody aware of systems that are actually already leveraging the CRC > > field? And if not, should we have a discussion on refining the spec to > > remove the ambiguity? > > > > Thank you, > > Boudewijn > > > > > -- > Ryan Blue > Software Engineer > Netflix >
