[
https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Boudewijn Braams updated PARQUET-1539:
--------------------------------------
Description:
Although a page-level CRC field is defined in the Thrift specification,
currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the
[comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607]
in the Thrift specification reads ‘32bit crc for the data below’, which is
somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum
should be calculated on. To ensure backward- and cross-compatibility of Parquet
readers/writes which do want to leverage the CRC checksums, the format should
specify exactly how and on what data the checksum should be calculated.
h2. Alternatives
There are two main choices to be made here: # Whether to include the page
header itself in the checksum calculation
# Whether to calculate the checksum on uncompressed or compressed data
# Whether to calculate the checksum on uncompressed or compressed data
h3. Algorithm
The CRC field holds a 32-bit value. There are many different variants of the
original CRC32 algorithm, each producing different values for the same input.
For ease of implementation we propose to use the standard CRC32 algorithm.
h3. Including page header
The page header itself could be included in the checksum calculation using an
approach similar to what TCP does, whereby the checksum field itself is zeroed
out before calculating the checksum that will be inserted there. Evidently,
including the page header is better in the sense that it increases the data
covered by the checksum. However, from an implementation perspective, not
including it is likely easier. Furthermore, given the relatively small size of
the page header compared to the page itself, simply not including it will
likely be good enough.
h3. Compressed vs uncompressed
*Compressed*
Pros
* Inherently faster, less data to operate on
* Potentially better triaging when determining where a corruption may have
been introduced, as checksum is calculated in a later stage
Cons
* We have to trust both the encoding stage and the compression stage
*Uncompressed*
Pros
* We only have to trust the encoding stage
* Possibly able to detect more corruptions, as data is checksummed at earliest
possible moment, checksum will be more sensitive to corruption introduced
further down the line
Cons
* Inherently slower, more data to operate on, always need to decompress first
* Potentially harder triaging, more stages in which corruption could have been
introduced
h2. Proposal
The checksum will be calculated using the standard CRC32 algorithm, whereby the
checksum is to be calculated on the *data only, not including the page header*
itself (simple implementation) and the checksum will be calculated on
*compressed data* (inherently faster, likely better triaging).
was:
Although a page-level CRC field is defined in the Thrift specification,
currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the
[comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607]
in the Thrift specification reads ‘32bit crc for the data below’, which is
somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum
should be calculated on. To ensure backward- and cross-compatibility of Parquet
readers/writes which do want to leverage the CRC checksums, the format should
specify exactly how and on what data the checksum should be calculated.
h2. Alternatives
There are two main choices to be made here: # Whether to include the page
header itself in the checksum calculation
# Whether to calculate the checksum on uncompressed or compressed data
# Whether to calculate the checksum on uncompressed or compressed data
h3. Algorithm
The CRC field holds a 32-bit value. There are many different variants of the
CRC32 algorithm, each producing different values for the same input. For ease
of implementation we propose to use the standard CRC32 algorithm.
h3. Including page header
The page header itself could be included in the checksum calculation using an
approach similar to what TCP does, whereby the checksum field itself is zeroed
out before calculating the checksum that will be inserted there. Evidently,
including the page header is better in the sense that it increases the data
covered by the checksum. However, from an implementation perspective, not
including it is likely easier. Furthermore, given the relatively small size of
the page header compared to the page itself, simply not including it will
likely be good enough.
h3. Compressed vs uncompressed
*Compressed*
Pros
* Inherently faster, less data to operate on
* Potentially better triaging when determining where a corruption may have
been introduced, as checksum is calculated in a later stage
Cons
* We have to trust both the encoding stage and the compression stage
*Uncompressed*
Pros
* We only have to trust the encoding stage
* Possibly able to detect more corruptions, as data is checksummed at earliest
possible moment, checksum will be more sensitive to corruption introduced
further down the line
Cons
* Inherently slower, more data to operate on, always need to decompress first
* Potentially harder triaging, more stages in which corruption could have been
introduced
h2. Proposal
The checksum will be calculated using the standard CRC32 algorithm, whereby the
checksum is to be calculated on the *data only, not including the page header*
itself (simple implementation) and the checksum will be calculated on
*compressed data* (inherently faster, likely better triaging).
> Clarify CRC checksum in page header
> -----------------------------------
>
> Key: PARQUET-1539
> URL: https://issues.apache.org/jira/browse/PARQUET-1539
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-format
> Reporter: Boudewijn Braams
> Priority: Major
>
> Although a page-level CRC field is defined in the Thrift specification,
> currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the
> [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607]
> in the Thrift specification reads ‘32bit crc for the data below’, which is
> somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum
> should be calculated on. To ensure backward- and cross-compatibility of
> Parquet readers/writes which do want to leverage the CRC checksums, the
> format should specify exactly how and on what data the checksum should be
> calculated.
> h2. Alternatives
> There are two main choices to be made here: # Whether to include the page
> header itself in the checksum calculation
> # Whether to calculate the checksum on uncompressed or compressed data
> # Whether to calculate the checksum on uncompressed or compressed data
> h3. Algorithm
> The CRC field holds a 32-bit value. There are many different variants of the
> original CRC32 algorithm, each producing different values for the same input.
> For ease of implementation we propose to use the standard CRC32 algorithm.
> h3. Including page header
> The page header itself could be included in the checksum calculation using an
> approach similar to what TCP does, whereby the checksum field itself is
> zeroed out before calculating the checksum that will be inserted there.
> Evidently, including the page header is better in the sense that it increases
> the data covered by the checksum. However, from an implementation
> perspective, not including it is likely easier. Furthermore, given the
> relatively small size of the page header compared to the page itself, simply
> not including it will likely be good enough.
> h3. Compressed vs uncompressed
> *Compressed*
> Pros
> * Inherently faster, less data to operate on
> * Potentially better triaging when determining where a corruption may have
> been introduced, as checksum is calculated in a later stage
> Cons
> * We have to trust both the encoding stage and the compression stage
> *Uncompressed*
> Pros
> * We only have to trust the encoding stage
> * Possibly able to detect more corruptions, as data is checksummed at
> earliest possible moment, checksum will be more sensitive to corruption
> introduced further down the line
> Cons
> * Inherently slower, more data to operate on, always need to decompress first
> * Potentially harder triaging, more stages in which corruption could have
> been introduced
> h2. Proposal
> The checksum will be calculated using the standard CRC32 algorithm, whereby
> the checksum is to be calculated on the *data only, not including the page
> header* itself (simple implementation) and the checksum will be calculated on
> *compressed data* (inherently faster, likely better triaging).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)