This is an automated email from the ASF dual-hosted git repository.
gabor pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push:
new db23fe3 PARQUET-1539: Clarify CRC checksum in page header (#126)
db23fe3 is described below
commit db23fe3b7a141ee6b0903af089cbc2bc22a43f97
Author: Boudewijn Braams <[email protected]>
AuthorDate: Tue Mar 5 14:26:48 2019 +0100
PARQUET-1539: Clarify CRC checksum in page header (#126)
---
README.md | 4 +++-
src/main/thrift/parquet.thrift | 25 +++++++++++++++++++++++--
2 files changed, 26 insertions(+), 3 deletions(-)
diff --git a/README.md b/README.md
index c759be9..01193ae 100644
--- a/README.md
+++ b/README.md
@@ -195,7 +195,9 @@ the reasoning behind adding these to the format.
## Checksumming
Data pages can be individually checksummed. This allows disabling of
checksums at the
-HDFS file level, to better support single row lookups.
+HDFS file level, to better support single row lookups. Data page checksums are
calculated
+using the standard CRC32 algorithm on the compressed data of a page (not
including the
+page header itself).
## Error recovery
If the file metadata is corrupt, the file is lost. If the column metadata is
corrupt,
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 7a29b80..4272cc3 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -604,8 +604,29 @@ struct PageHeader {
/** Compressed page size in bytes (not including this header) **/
3: required i32 compressed_page_size
- /** 32bit crc for the data below. This allows for disabling checksumming in
HDFS
- * if only a few pages needs to be read
+ /** The 32bit CRC for the page, to be be calculated as follows:
+ * - Using the standard CRC32 algorithm
+ * - On the data only, i.e. this header should not be included. 'Data'
+ * hereby refers to the concatenation of the repetition levels, the
+ * definition levels and the column value, in this exact order.
+ * - On the encoded versions of the repetition levels, definition levels and
+ * column values
+ * - On the compressed versions of the repetition levels, definition levels
+ * and column values where possible;
+ * - For v1 data pages, the repetition levels, definition levels and column
+ * values are always compressed together. If a compression scheme is
+ * specified, the CRC shall be calculated on the compressed version of
+ * this concatenation. If no compression scheme is specified, the CRC
+ * shall be calculated on the uncompressed version of this concatenation.
+ * - For v2 data pages, the repetition levels and definition levels are
+ * handled separately from the data and are never compressed (only
+ * encoded). If a compression scheme is specified, the CRC shall be
+ * calculated on the concatenation of the uncompressed repetition levels,
+ * uncompressed definition levels and the compressed column values.
+ * If no compression scheme is specified, the CRC shall be calculated on
+ * the uncompressed concatenation.
+ * If enabled, this allows for disabling checksumming in HDFS if only a few
+ * pages need to be read.
**/
4: optional i32 crc