This is an automated email from the ASF dual-hosted git repository. gangwu pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/parquet-testing.git
commit b2e7cc755159196e3a068c8594f7acbaecfdaaac Author: mwish <maplewish...@gmail.com> AuthorDate: Thu Feb 23 01:06:25 2023 +0800 Add test files for dictionary page crc --- data/README.md | 21 +++++++++++++++++++++ data/plain-dict-uncompressed-checksum.parquet | Bin 0 -> 816 bytes data/rle-dict-snappy-checksum.parquet | Bin 0 -> 822 bytes .../rle-dict-uncompressed-corrupt-checksum.parquet | Bin 0 -> 814 bytes 4 files changed, 21 insertions(+) diff --git a/data/README.md b/data/README.md index dd25ade..638f0d1 100644 --- a/data/README.md +++ b/data/README.md @@ -41,6 +41,9 @@ | bloom_filter.bin | deprecated bloom filter binary with binary header and murmur3 hashing | | bloom_filter.xxhash.bin | bloom filter binary with thrift header and xxhash hashing | | nan_in_stats.parquet | statistics contains NaN in max, from PyArrow 0.8.0. See note below on "NaN in stats". | +| rle-dict-snappy-checksum.parquet | compressed and dictionary-encoded INT32 and STRING columns in format v2 with a matching CRC | +| plain-dict-uncompressed-checksum.parquet | uncompressed and dictionary-encoded INT32 and STRING columns in format v1 with a matching CRC | +| rle-dict-uncompressed-corrupt-checksum.parquet | uncompressed and dictionary-encoded INT32 and STRING columns in format v2 with a mismatching CRC | TODO: Document what each file is in the table above. @@ -111,6 +114,24 @@ The detailed structure for these files is as follows: [ Column "b" [ Page 0 [correct crc] | Uncompressed Contents ][ Page 1 [bad crc] | Uncompressed Contents ]] ``` +The schema for the `*-dict-*-checksum.parquet` test files is: +* `data/rle-dict-snappy-checksum.parquet`: + ``` + [ Column "long_field" [ Dict Page [correct crc] | Compressed PLAIN Contents ][ Page 0 [correct crc] | Compressed RLE_DICTIONARY Contents ]] + [ Column "binary_field" [ Dict Page [correct crc] | Compressed PLAIN Contents ][ Page 0 [correct crc] | Compressed RLE_DICTIONARY Contents ]] + ``` + +* `data/plain-dict-uncompressed-checksum.parquet`: + ``` + [ Column "long_field" [ Dict Page [correct crc] | Uncompressed PLAIN_DICTIONARY(DICT) Contents ][ Page 0 [correct crc] | Uncompressed PLAIN_DICTIONARY Contents ]] + [ Column "binary_field" [ Dict Page [correct crc] | Uncompressed PLAIN_DICTIONARY(DICT) Contents ][ Page 0 [correct crc] | Uncompressed PLAIN_DICTIONARY Contents ]] + ``` + +* `data/rle-dict-uncompressed-corrupt-checksum.parquet`: + ``` + [ Column "long_field" [ Dict Page [bad crc] | Uncompressed PLAIN Contents ][ Page 0 [correct crc] | Uncompressed RLE_DICTIONARY Contents ]] + [ Column "binary_field" [ Dict Page [bad crc] | Uncompressed PLAIN Contents ][ Page 0 [correct crc] | Uncompressed RLE_DICTIONARY Contents ]] + ``` ## Bloom Filter Files Bloom filter examples have been generated by parquet-mr. diff --git a/data/plain-dict-uncompressed-checksum.parquet b/data/plain-dict-uncompressed-checksum.parquet new file mode 100644 index 0000000..f49f1c4 Binary files /dev/null and b/data/plain-dict-uncompressed-checksum.parquet differ diff --git a/data/rle-dict-snappy-checksum.parquet b/data/rle-dict-snappy-checksum.parquet new file mode 100644 index 0000000..4c183d8 Binary files /dev/null and b/data/rle-dict-snappy-checksum.parquet differ diff --git a/data/rle-dict-uncompressed-corrupt-checksum.parquet b/data/rle-dict-uncompressed-corrupt-checksum.parquet new file mode 100644 index 0000000..20e23aa Binary files /dev/null and b/data/rle-dict-uncompressed-corrupt-checksum.parquet differ