This is an automated email from the ASF dual-hosted git repository.
apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-testing.git
The following commit(s) were added to refs/heads/master by this push:
new 3510fa8 ARROW-17904: [Parquet] Add data files with checksums on
datapage v1 (#29)
3510fa8 is described below
commit 3510fa8d34ba577f36f399d4642f9e1ccdf18b30
Author: mwish <[email protected]>
AuthorDate: Mon Dec 5 23:20:57 2022 +0800
ARROW-17904: [Parquet] Add data files with checksums on datapage v1 (#29)
---
data/README.md | 33 +++++++++++++++++++++
data/datapage_v1-corrupt-checksum.parquet | Bin 0 -> 41421 bytes
.../datapage_v1-snappy-compressed-checksum.parquet | Bin 0 -> 3380 bytes
data/datapage_v1-uncompressed-checksum.parquet | Bin 0 -> 41421 bytes
4 files changed, 33 insertions(+)
diff --git a/data/README.md b/data/README.md
index 34d60ec..398a88c 100644
--- a/data/README.md
+++ b/data/README.md
@@ -32,6 +32,9 @@
| alltypes_tiny_pages.parquet | small page sizes with dictionary
encoding with page index from
[impala](https://github.com/apache/impala/tree/master/testdata/data/alltypes_tiny_pages.parquet).
|
| alltypes_tiny_pages_plain.parquet | small page sizes with plain
encoding with page index
[impala](https://github.com/apache/impala/tree/master/testdata/data/alltypes_tiny_pages.parquet).
|
| rle_boolean_encoding.parquet | option boolean columns with RLE
encoding
|
+| datapage_v1-uncompressed-checksum.parquet | uncompressed INT32 columns
in v1 data pages with a matching CRC |
+| datapage_v1-snappy-compressed-checksum.parquet | compressed INT32 columns in
v1 data pages with a matching CRC |
+| datapage_v1-corrupt-checksum.parquet | uncompressed INT32 columns
in v1 data pages with a mismatching CRC |
TODO: Document what each file is in the table above.
@@ -65,3 +68,33 @@ A sample that reads and checks these files can be found at
the following tests:
cpp/src/parquet/encryption-read-configurations-test.cc
cpp/src/parquet/test-encryption-util.h
```
+
+## Checksum Files
+
+The schema for the `datapage_v1-*-checksum.parquet` test files is:
+```
+message m {
+ required int32 a;
+ required int32 b;
+}
+```
+
+The detailed structure for these files is as follows:
+
+* `data/datapage_v1-uncompressed-checksum.parquet`:
+ ```
+ [ Column "a" [ Page 0 [correct crc] | Uncompressed Contents ][ Page 1
[correct crc] | Uncompressed Contents ]]
+ [ Column "b" [ Page 0 [correct crc] | Uncompressed Contents ][ Page 1
[correct crc] | Uncompressed Contents ]]
+ ```
+
+* `data/datapage_v1-snappy-compressed-checksum.parquet`:
+ ```
+ [ Column "a" [ Page 0 [correct crc] | Snappy Contents ][ Page 1 [correct
crc] | Snappy Contents ]]
+ [ Column "b" [ Page 0 [correct crc] | Snappy Contents ][ Page 1 [correct
crc] | Snappy Contents ]]
+ ```
+
+* `data/datapage_v1-corrupt-checksum.parquet`:
+ ```
+ [ Column "a" [ Page 0 [bad crc] | Uncompressed Contents ][ Page 1 [correct
crc] | Uncompressed Contents ]]
+ [ Column "b" [ Page 0 [correct crc] | Uncompressed Contents ][ Page 1 [bad
crc] | Uncompressed Contents ]]
+ ```
diff --git a/data/datapage_v1-corrupt-checksum.parquet
b/data/datapage_v1-corrupt-checksum.parquet
new file mode 100644
index 0000000..d832edc
Binary files /dev/null and b/data/datapage_v1-corrupt-checksum.parquet differ
diff --git a/data/datapage_v1-snappy-compressed-checksum.parquet
b/data/datapage_v1-snappy-compressed-checksum.parquet
new file mode 100644
index 0000000..8fe2c86
Binary files /dev/null and
b/data/datapage_v1-snappy-compressed-checksum.parquet differ
diff --git a/data/datapage_v1-uncompressed-checksum.parquet
b/data/datapage_v1-uncompressed-checksum.parquet
new file mode 100644
index 0000000..78044f0
Binary files /dev/null and b/data/datapage_v1-uncompressed-checksum.parquet
differ