This is an automated email from the ASF dual-hosted git repository.
apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-testing.git
The following commit(s) were added to refs/heads/master by this push:
new e2d244a GH-15164: Add new bloom filter example for current spec (#34)
e2d244a is described below
commit e2d244ab9a84d382e3a50f55db41f362e450428b
Author: mwish <[email protected]>
AuthorDate: Thu Jan 26 00:39:16 2023 +0800
GH-15164: Add new bloom filter example for current spec (#34)
Co-authored-by: Antoine Pitrou <[email protected]>
---
data/README.md | 16 ++++++++++++++++
data/bloom_filter.xxhash.bin | Bin 0 -> 1040 bytes
2 files changed, 16 insertions(+)
diff --git a/data/README.md b/data/README.md
index f3e3fac..b2c5128 100644
--- a/data/README.md
+++ b/data/README.md
@@ -38,6 +38,8 @@
| datapage_v1-snappy-compressed-checksum.parquet | compressed INT32 columns in
v1 data pages with a matching CRC |
| datapage_v1-corrupt-checksum.parquet | uncompressed INT32 columns
in v1 data pages with a mismatching CRC |
| overflow_i16_page_cnt.parquet | row group with more than
INT16_MAX pages |
+| bloom_filter.bin | deprecated bloom filter
binary with binary header and murmur3 hashing |
+| bloom_filter.xxhash.bin | bloom filter binary with
thrift header and xxhash hashing |
TODO: Document what each file is in the table above.
@@ -101,3 +103,17 @@ The detailed structure for these files is as follows:
[ Column "a" [ Page 0 [bad crc] | Uncompressed Contents ][ Page 1 [correct
crc] | Uncompressed Contents ]]
[ Column "b" [ Page 0 [correct crc] | Uncompressed Contents ][ Page 1 [bad
crc] | Uncompressed Contents ]]
```
+
+## Bloom Filter Files
+
+Bloom filter examples have been generated by parquet-mr.
+They are not Parquet files but only contain the bloom filter header and
payload.
+
+For each of `bloom_filter.bin` and `bloom_filter.xxhash.bin`, the bloom filter
+was generated by inserting the strings "hello", "parquet", "bloom", "filter".
+
+`bloom_filter.bin` uses the original Murmur3-based bloom filter format as of
+https://github.com/apache/parquet-format/commit/54839ad5e04314c944fed8aa4bc6cf15e4a58698.
+
+`bloom_filter.xxhash.bin` uses the newer xxHash-based bloom filter format as of
+https://github.com/apache/parquet-format/commit/3fb10e00c2204bf1c6cc91e094c59e84cefcee33.
diff --git a/data/bloom_filter.xxhash.bin b/data/bloom_filter.xxhash.bin
new file mode 100644
index 0000000..c98a526
Binary files /dev/null and b/data/bloom_filter.xxhash.bin differ