[parquet-format] branch master updated: PARQUET-2218: [Format] Clarify CRC computation (#188)

apitrou Tue, 03 Jan 2023 06:28:30 -0800

This is an automated email from the ASF dual-hosted git repository.

apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git



The following commit(s) were added to refs/heads/master by this push:
     new 613a1cf  PARQUET-2218: [Format] Clarify CRC computation (#188)
613a1cf is described below

commit 613a1cf4475c662457a0fd81a894ce4709799e3b
Author: Antoine Pitrou <[email protected]>
AuthorDate: Tue Jan 3 15:28:17 2023 +0100

    PARQUET-2218: [Format] Clarify CRC computation (#188)
    
    When trying to implement CRC computation in Parquet C++, we found the 
wording to be ambiguous.
    
    Clarify that CRC computation happens on the exact binary serialization 
(instead of a long-winded and confusing elaboration about v1 and v2 data page 
layout).
    
    Also, clarify that CRC computation can apply to all page kinds, not only 
data pages
    (for reference, parquet-mr currently support checksumming v1 data pages as 
well as dictionary pages).
    
    Also, see discussion on 
https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 and 
below.
---
 README.md                      |  8 ++++----
 src/main/thrift/parquet.thrift | 39 +++++++++++++++------------------------
 2 files changed, 19 insertions(+), 28 deletions(-)

diff --git a/README.md b/README.md
index 99b0546..d0f654f 100644
--- a/README.md
+++ b/README.md
@@ -239,10 +239,10 @@ skip pages more efficiently. See 
[PageIndex.md](PageIndex.md) for details and
 the reasoning behind adding these to the format.
 
 ## Checksumming
-Data pages can be individually checksummed.  This allows disabling of 
checksums at the
-HDFS file level, to better support single row lookups. Data page checksums are 
calculated
-using the standard CRC32 algorithm on the compressed data of a page (not 
including the
-page header itself).
+Pages of all kinds can be individually checksummed. This allows disabling of 
checksums
+at the HDFS file level, to better support single row lookups. Checksums are 
calculated
+using the standard CRC32 algorithm - as used in e.g. GZip - on the serialized 
binary
+representation of a page (not including the page header itself).
 
 ## Error recovery
 If the file metadata is corrupt, the file is lost.  If the column metadata is 
corrupt,
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 8c4ddd0..54beb47 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -639,32 +639,23 @@ struct PageHeader {
   /** Compressed (and potentially encrypted) page size in bytes, not including 
this header **/
   3: required i32 compressed_page_size
 
-  /** The 32bit CRC for the page, to be be calculated as follows:
-   * - Using the standard CRC32 algorithm
-   * - On the data only, i.e. this header should not be included. 'Data'
-   *   hereby refers to the concatenation of the repetition levels, the
-   *   definition levels and the column value, in this exact order.
-   * - On the encoded versions of the repetition levels, definition levels and
-   *   column values
-   * - On the compressed versions of the repetition levels, definition levels
-   *   and column values where possible;
-   *   - For v1 data pages, the repetition levels, definition levels and column
-   *     values are always compressed together. If a compression scheme is
-   *     specified, the CRC shall be calculated on the compressed version of
-   *     this concatenation. If no compression scheme is specified, the CRC
-   *     shall be calculated on the uncompressed version of this concatenation.
-   *   - For v2 data pages, the repetition levels and definition levels are
-   *     handled separately from the data and are never compressed (only
-   *     encoded). If a compression scheme is specified, the CRC shall be
-   *     calculated on the concatenation of the uncompressed repetition levels,
-   *     uncompressed definition levels and the compressed column values.
-   *     If no compression scheme is specified, the CRC shall be calculated on
-   *     the uncompressed concatenation.
-   * - In encrypted columns, CRC is calculated after page encryption; the
-   *   encryption itself is performed after page compression (if compressed)
+  /** The 32-bit CRC checksum for the page, to be be calculated as follows:
+   *
+   * - The standard CRC32 algorithm is used (with polynomial 0x04C11DB7,
+   *   the same as in e.g. GZip).
+   * - All page types can have a CRC (v1 and v2 data pages, dictionary pages,
+   *   etc.).
+   * - The CRC is computed on the serialization binary representation of the 
page
+   *   (as written to disk), excluding the page header. For example, for v1
+   *   data pages, the CRC is computed on the concatenation of repetition 
levels,
+   *   definition levels and column values (optionally compressed, optionally
+   *   encrypted).
+   * - The CRC computation therefore takes place after any compression
+   *   and encryption steps, if any.
+   *
    * If enabled, this allows for disabling checksumming in HDFS if only a few
    * pages need to be read.
-   **/
+   */
   4: optional i32 crc
 
   // Headers for page specific data.  One only will be set.

[parquet-format] branch master updated: PARQUET-2218: [Format] Clarify CRC computation (#188)

Reply via email to