[
https://issues.apache.org/jira/browse/PARQUET-1539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17646595#comment-17646595
]
ASF GitHub Bot commented on PARQUET-1539:
-----------------------------------------
mapleFU commented on PR #126:
URL: https://github.com/apache/parquet-format/pull/126#issuecomment-1348405417
So, should we update the `parquet-format`, or just keep it here and not
implement crc in parquet c++ version? @pitrou
> Clarify CRC checksum in page header
> -----------------------------------
>
> Key: PARQUET-1539
> URL: https://issues.apache.org/jira/browse/PARQUET-1539
> Project: Parquet
> Issue Type: Improvement
> Components: parquet-format
> Reporter: Boudewijn Braams
> Assignee: Boudewijn Braams
> Priority: Major
> Labels: pull-request-available
> Fix For: format-2.7.0
>
>
> Although a page-level CRC field is defined in the Thrift specification,
> currently neither parquet-cpp nor parquet-mr leverage it. Moreover, the
> [comment|https://github.com/apache/parquet-format/blob/2b38663a28ccd4156319c0bf7ae4e6280e0c6e2d/src/main/thrift/parquet.thrift#L607]
> in the Thrift specification reads ‘32bit crc for the data below’, which is
> somewhat ambiguous to what exactly constitutes the ‘data’ that the checksum
> should be calculated on. To ensure backward- and cross-compatibility of
> Parquet readers/writes which do want to leverage the CRC checksums, the
> format should specify exactly how and on what data the checksum should be
> calculated.
> h2. Alternatives
> There are three main choices to be made here:
> # Which variant of CRC32 to use
> # Whether to include the page header itself in the checksum calculation
> # Whether to calculate the checksum on uncompressed or compressed data
> h3. Algorithm
> The CRC field holds a 32-bit value. There are many different variants of the
> original CRC32 algorithm, each producing different values for the same input.
> For ease of implementation we propose to use the standard CRC32 algorithm.
> h3. Including page header
> The page header itself could be included in the checksum calculation using an
> approach similar to what TCP does, whereby the checksum field itself is
> zeroed out before calculating the checksum that will be inserted there.
> Evidently, including the page header is better in the sense that it increases
> the data covered by the checksum. However, from an implementation
> perspective, not including it is likely easier. Furthermore, given the
> relatively small size of the page header compared to the page itself, simply
> not including it will likely be good enough.
> h3. Compressed vs uncompressed
> *Compressed*
> Pros
> * Inherently faster, less data to operate on
> * Potentially better triaging when determining where a corruption may have
> been introduced, as checksum is calculated in a later stage
> Cons
> * We have to trust both the encoding stage and the compression stage
> *Uncompressed*
> Pros
> * We only have to trust the encoding stage
> * Possibly able to detect more corruptions, as data is checksummed at
> earliest possible moment, checksum will be more sensitive to corruption
> introduced further down the line
> Cons
> * Inherently slower, more data to operate on, always need to decompress first
> * Potentially harder triaging, more stages in which corruption could have
> been introduced
> h2. Proposal
> The checksum will be calculated using the *standard CRC32 algorithm*, whereby
> the checksum is to be calculated on the *data only, not including the page
> header* itself (simple implementation) and the checksum will be calculated on
> *compressed data* (inherently faster, likely better triaging).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)