This is an automated email from the ASF dual-hosted git repository.

apitrou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-testing.git


The following commit(s) were added to refs/heads/master by this push:
     new e2d244a  GH-15164: Add new bloom filter example for current spec (#34)
e2d244a is described below

commit e2d244ab9a84d382e3a50f55db41f362e450428b
Author: mwish <[email protected]>
AuthorDate: Thu Jan 26 00:39:16 2023 +0800

    GH-15164: Add new bloom filter example for current spec (#34)
    
    Co-authored-by: Antoine Pitrou <[email protected]>
---
 data/README.md               |  16 ++++++++++++++++
 data/bloom_filter.xxhash.bin | Bin 0 -> 1040 bytes
 2 files changed, 16 insertions(+)

diff --git a/data/README.md b/data/README.md
index f3e3fac..b2c5128 100644
--- a/data/README.md
+++ b/data/README.md
@@ -38,6 +38,8 @@
 | datapage_v1-snappy-compressed-checksum.parquet | compressed INT32 columns in 
v1 data pages with a matching CRC          |
 | datapage_v1-corrupt-checksum.parquet           | uncompressed INT32 columns 
in v1 data pages with a mismatching CRC     |
 | overflow_i16_page_cnt.parquet                  | row group with more than 
INT16_MAX pages                   |
+| bloom_filter.bin                               | deprecated bloom filter 
binary with binary header and murmur3 hashing |
+| bloom_filter.xxhash.bin                        | bloom filter binary with 
thrift header and xxhash hashing    |
 
 TODO: Document what each file is in the table above.
 
@@ -101,3 +103,17 @@ The detailed structure for these files is as follows:
   [ Column "a" [ Page 0 [bad crc] | Uncompressed Contents ][ Page 1 [correct 
crc] | Uncompressed Contents ]]
   [ Column "b" [ Page 0 [correct crc] | Uncompressed Contents ][ Page 1 [bad 
crc] | Uncompressed Contents ]]
   ```
+
+## Bloom Filter Files
+
+Bloom filter examples have been generated by parquet-mr.
+They are not Parquet files but only contain the bloom filter header and 
payload.
+
+For each of `bloom_filter.bin` and `bloom_filter.xxhash.bin`, the bloom filter
+was generated by inserting the strings "hello", "parquet", "bloom", "filter".
+
+`bloom_filter.bin` uses the original Murmur3-based bloom filter format as of
+https://github.com/apache/parquet-format/commit/54839ad5e04314c944fed8aa4bc6cf15e4a58698.
+
+`bloom_filter.xxhash.bin` uses the newer xxHash-based bloom filter format as of
+https://github.com/apache/parquet-format/commit/3fb10e00c2204bf1c6cc91e094c59e84cefcee33.
diff --git a/data/bloom_filter.xxhash.bin b/data/bloom_filter.xxhash.bin
new file mode 100644
index 0000000..c98a526
Binary files /dev/null and b/data/bloom_filter.xxhash.bin differ

Reply via email to