This is an automated email from the ASF dual-hosted git repository.
apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push:
new 613a1cf PARQUET-2218: [Format] Clarify CRC computation (#188)
613a1cf is described below
commit 613a1cf4475c662457a0fd81a894ce4709799e3b
Author: Antoine Pitrou <[email protected]>
AuthorDate: Tue Jan 3 15:28:17 2023 +0100
PARQUET-2218: [Format] Clarify CRC computation (#188)
When trying to implement CRC computation in Parquet C++, we found the
wording to be ambiguous.
Clarify that CRC computation happens on the exact binary serialization
(instead of a long-winded and confusing elaboration about v1 and v2 data page
layout).
Also, clarify that CRC computation can apply to all page kinds, not only
data pages
(for reference, parquet-mr currently support checksumming v1 data pages as
well as dictionary pages).
Also, see discussion on
https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 and
below.
---
README.md | 8 ++++----
src/main/thrift/parquet.thrift | 39 +++++++++++++++------------------------
2 files changed, 19 insertions(+), 28 deletions(-)
diff --git a/README.md b/README.md
index 99b0546..d0f654f 100644
--- a/README.md
+++ b/README.md
@@ -239,10 +239,10 @@ skip pages more efficiently. See
[PageIndex.md](PageIndex.md) for details and
the reasoning behind adding these to the format.
## Checksumming
-Data pages can be individually checksummed. This allows disabling of
checksums at the
-HDFS file level, to better support single row lookups. Data page checksums are
calculated
-using the standard CRC32 algorithm on the compressed data of a page (not
including the
-page header itself).
+Pages of all kinds can be individually checksummed. This allows disabling of
checksums
+at the HDFS file level, to better support single row lookups. Checksums are
calculated
+using the standard CRC32 algorithm - as used in e.g. GZip - on the serialized
binary
+representation of a page (not including the page header itself).
## Error recovery
If the file metadata is corrupt, the file is lost. If the column metadata is
corrupt,
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 8c4ddd0..54beb47 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -639,32 +639,23 @@ struct PageHeader {
/** Compressed (and potentially encrypted) page size in bytes, not including
this header **/
3: required i32 compressed_page_size
- /** The 32bit CRC for the page, to be be calculated as follows:
- * - Using the standard CRC32 algorithm
- * - On the data only, i.e. this header should not be included. 'Data'
- * hereby refers to the concatenation of the repetition levels, the
- * definition levels and the column value, in this exact order.
- * - On the encoded versions of the repetition levels, definition levels and
- * column values
- * - On the compressed versions of the repetition levels, definition levels
- * and column values where possible;
- * - For v1 data pages, the repetition levels, definition levels and column
- * values are always compressed together. If a compression scheme is
- * specified, the CRC shall be calculated on the compressed version of
- * this concatenation. If no compression scheme is specified, the CRC
- * shall be calculated on the uncompressed version of this concatenation.
- * - For v2 data pages, the repetition levels and definition levels are
- * handled separately from the data and are never compressed (only
- * encoded). If a compression scheme is specified, the CRC shall be
- * calculated on the concatenation of the uncompressed repetition levels,
- * uncompressed definition levels and the compressed column values.
- * If no compression scheme is specified, the CRC shall be calculated on
- * the uncompressed concatenation.
- * - In encrypted columns, CRC is calculated after page encryption; the
- * encryption itself is performed after page compression (if compressed)
+ /** The 32-bit CRC checksum for the page, to be be calculated as follows:
+ *
+ * - The standard CRC32 algorithm is used (with polynomial 0x04C11DB7,
+ * the same as in e.g. GZip).
+ * - All page types can have a CRC (v1 and v2 data pages, dictionary pages,
+ * etc.).
+ * - The CRC is computed on the serialization binary representation of the
page
+ * (as written to disk), excluding the page header. For example, for v1
+ * data pages, the CRC is computed on the concatenation of repetition
levels,
+ * definition levels and column values (optionally compressed, optionally
+ * encrypted).
+ * - The CRC computation therefore takes place after any compression
+ * and encryption steps, if any.
+ *
* If enabled, this allows for disabling checksumming in HDFS if only a few
* pages need to be read.
- **/
+ */
4: optional i32 crc
// Headers for page specific data. One only will be set.