[parquet-testing] 01/01: Add test files for dictionary page crc

gangwu Mon, 06 Mar 2023 03:34:46 -0800

This is an automated email from the ASF dual-hosted git repository.

gangwu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-testing.git


commit b2e7cc755159196e3a068c8594f7acbaecfdaaac
Author: mwish <maplewish...@gmail.com>
AuthorDate: Thu Feb 23 01:06:25 2023 +0800

    Add test files for dictionary page crc
---
 data/README.md                                     |  21 +++++++++++++++++++++
 data/plain-dict-uncompressed-checksum.parquet      | Bin 0 -> 816 bytes
 data/rle-dict-snappy-checksum.parquet              | Bin 0 -> 822 bytes
 .../rle-dict-uncompressed-corrupt-checksum.parquet | Bin 0 -> 814 bytes
 4 files changed, 21 insertions(+)

diff --git a/data/README.md b/data/README.md
index dd25ade..638f0d1 100644
--- a/data/README.md
+++ b/data/README.md
@@ -41,6 +41,9 @@
 | bloom_filter.bin                               | deprecated bloom filter 
binary with binary header and murmur3 hashing |
 | bloom_filter.xxhash.bin                        | bloom filter binary with 
thrift header and xxhash hashing    |
 | nan_in_stats.parquet                           | statistics contains NaN in 
max, from PyArrow 0.8.0. See note below on "NaN in stats".  |
+| rle-dict-snappy-checksum.parquet                 | compressed and 
dictionary-encoded INT32 and STRING columns in format v2 with a matching CRC |
+| plain-dict-uncompressed-checksum.parquet         | uncompressed and 
dictionary-encoded INT32 and STRING columns in format v1 with a matching CRC |
+| rle-dict-uncompressed-corrupt-checksum.parquet   | uncompressed and 
dictionary-encoded INT32 and STRING columns in format v2 with a mismatching CRC 
|
 
 TODO: Document what each file is in the table above.
 
@@ -111,6 +114,24 @@ The detailed structure for these files is as follows:
   [ Column "b" [ Page 0 [correct crc] | Uncompressed Contents ][ Page 1 [bad 
crc] | Uncompressed Contents ]]
   ```
 
+The schema for the `*-dict-*-checksum.parquet` test files is:
+* `data/rle-dict-snappy-checksum.parquet`:
+  ```
+  [ Column "long_field" [ Dict Page [correct crc] | Compressed PLAIN Contents 
][ Page 0 [correct crc] | Compressed RLE_DICTIONARY Contents ]]
+  [ Column "binary_field" [ Dict Page [correct crc] | Compressed PLAIN 
Contents ][ Page 0 [correct crc] | Compressed RLE_DICTIONARY Contents ]]
+  ```
+
+* `data/plain-dict-uncompressed-checksum.parquet`:
+  ```
+  [ Column "long_field" [ Dict Page [correct crc] | Uncompressed 
PLAIN_DICTIONARY(DICT) Contents ][ Page 0 [correct crc] | Uncompressed 
PLAIN_DICTIONARY Contents ]]
+  [ Column "binary_field" [ Dict Page [correct crc] | Uncompressed 
PLAIN_DICTIONARY(DICT) Contents ][ Page 0 [correct crc] | Uncompressed 
PLAIN_DICTIONARY Contents ]]
+  ```
+
+* `data/rle-dict-uncompressed-corrupt-checksum.parquet`:
+  ```
+  [ Column "long_field" [ Dict Page [bad crc] | Uncompressed PLAIN Contents ][ 
Page 0 [correct crc] | Uncompressed RLE_DICTIONARY Contents ]]
+  [ Column "binary_field" [ Dict Page [bad crc] | Uncompressed PLAIN Contents 
][ Page 0 [correct crc] | Uncompressed RLE_DICTIONARY Contents ]]
+  ```
 ## Bloom Filter Files
 
 Bloom filter examples have been generated by parquet-mr.
diff --git a/data/plain-dict-uncompressed-checksum.parquet 
b/data/plain-dict-uncompressed-checksum.parquet
new file mode 100644
index 0000000..f49f1c4
Binary files /dev/null and b/data/plain-dict-uncompressed-checksum.parquet 
differ
diff --git a/data/rle-dict-snappy-checksum.parquet 
b/data/rle-dict-snappy-checksum.parquet
new file mode 100644
index 0000000..4c183d8
Binary files /dev/null and b/data/rle-dict-snappy-checksum.parquet differ
diff --git a/data/rle-dict-uncompressed-corrupt-checksum.parquet 
b/data/rle-dict-uncompressed-corrupt-checksum.parquet
new file mode 100644
index 0000000..20e23aa
Binary files /dev/null and 
b/data/rle-dict-uncompressed-corrupt-checksum.parquet differ

[parquet-testing] 01/01: Add test files for dictionary page crc

Reply via email to