This is an automated email from the ASF dual-hosted git repository.

apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-testing.git


The following commit(s) were added to refs/heads/master by this push:
     new 3510fa8  ARROW-17904: [Parquet] Add data files with checksums on 
datapage v1 (#29)
3510fa8 is described below

commit 3510fa8d34ba577f36f399d4642f9e1ccdf18b30
Author: mwish <[email protected]>
AuthorDate: Mon Dec 5 23:20:57 2022 +0800

    ARROW-17904: [Parquet] Add data files with checksums on datapage v1 (#29)
---
 data/README.md                                     |  33 +++++++++++++++++++++
 data/datapage_v1-corrupt-checksum.parquet          | Bin 0 -> 41421 bytes
 .../datapage_v1-snappy-compressed-checksum.parquet | Bin 0 -> 3380 bytes
 data/datapage_v1-uncompressed-checksum.parquet     | Bin 0 -> 41421 bytes
 4 files changed, 33 insertions(+)

diff --git a/data/README.md b/data/README.md
index 34d60ec..398a88c 100644
--- a/data/README.md
+++ b/data/README.md
@@ -32,6 +32,9 @@
 | alltypes_tiny_pages.parquet             | small page sizes with dictionary 
encoding with page index from 
[impala](https://github.com/apache/impala/tree/master/testdata/data/alltypes_tiny_pages.parquet).
 |
 | alltypes_tiny_pages_plain.parquet       | small page sizes with plain 
encoding with page index 
[impala](https://github.com/apache/impala/tree/master/testdata/data/alltypes_tiny_pages.parquet).
           |
 | rle_boolean_encoding.parquet            | option boolean columns with RLE 
encoding                                                                        
                                                 |
+| datapage_v1-uncompressed-checksum.parquet      | uncompressed INT32 columns 
in v1 data pages with a matching CRC        |
+| datapage_v1-snappy-compressed-checksum.parquet | compressed INT32 columns in 
v1 data pages with a matching CRC          |
+| datapage_v1-corrupt-checksum.parquet           | uncompressed INT32 columns 
in v1 data pages with a mismatching CRC     |
 
 TODO: Document what each file is in the table above.
 
@@ -65,3 +68,33 @@ A sample that reads and checks these files can be found at 
the following tests:
 cpp/src/parquet/encryption-read-configurations-test.cc
 cpp/src/parquet/test-encryption-util.h
 ```
+
+## Checksum Files
+
+The schema for the `datapage_v1-*-checksum.parquet` test files is:
+```
+message m {
+    required int32 a;
+    required int32 b;
+} 
+```
+
+The detailed structure for these files is as follows:
+
+* `data/datapage_v1-uncompressed-checksum.parquet`:
+  ```
+  [ Column "a" [ Page 0 [correct crc] | Uncompressed Contents ][ Page 1 
[correct crc] | Uncompressed Contents ]]
+  [ Column "b" [ Page 0 [correct crc] | Uncompressed Contents ][ Page 1 
[correct crc] | Uncompressed Contents ]]
+  ```
+
+* `data/datapage_v1-snappy-compressed-checksum.parquet`:
+  ```
+  [ Column "a" [ Page 0 [correct crc] | Snappy Contents ][ Page 1 [correct 
crc] | Snappy Contents ]]
+  [ Column "b" [ Page 0 [correct crc] | Snappy Contents ][ Page 1 [correct 
crc] | Snappy Contents ]]
+  ```
+
+* `data/datapage_v1-corrupt-checksum.parquet`:
+  ```
+  [ Column "a" [ Page 0 [bad crc] | Uncompressed Contents ][ Page 1 [correct 
crc] | Uncompressed Contents ]]
+  [ Column "b" [ Page 0 [correct crc] | Uncompressed Contents ][ Page 1 [bad 
crc] | Uncompressed Contents ]]
+  ```
diff --git a/data/datapage_v1-corrupt-checksum.parquet 
b/data/datapage_v1-corrupt-checksum.parquet
new file mode 100644
index 0000000..d832edc
Binary files /dev/null and b/data/datapage_v1-corrupt-checksum.parquet differ
diff --git a/data/datapage_v1-snappy-compressed-checksum.parquet 
b/data/datapage_v1-snappy-compressed-checksum.parquet
new file mode 100644
index 0000000..8fe2c86
Binary files /dev/null and 
b/data/datapage_v1-snappy-compressed-checksum.parquet differ
diff --git a/data/datapage_v1-uncompressed-checksum.parquet 
b/data/datapage_v1-uncompressed-checksum.parquet
new file mode 100644
index 0000000..78044f0
Binary files /dev/null and b/data/datapage_v1-uncompressed-checksum.parquet 
differ

Reply via email to