This is an automated email from the ASF dual-hosted git repository.
blue pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git
The following commit(s) were added to refs/heads/master by this push:
new 3fb10e0 PARQUET-1630: Update Bloom filter format (#146)
3fb10e0 is described below
commit 3fb10e00c2204bf1c6cc91e094c59e84cefcee33
Author: Chen, Junjie <[email protected]>
AuthorDate: Tue Aug 27 07:27:32 2019 +0800
PARQUET-1630: Update Bloom filter format (#146)
---
BloomFilter.md | 18 ++++++++++++++----
doc/images/FileLayoutBloomFilter1.png | Bin 0 -> 44025 bytes
doc/images/FileLayoutBloomFilter2.png | Bin 0 -> 34018 bytes
3 files changed, 14 insertions(+), 4 deletions(-)
diff --git a/BloomFilter.md b/BloomFilter.md
index b8208c8..2fa24e9 100644
--- a/BloomFilter.md
+++ b/BloomFilter.md
@@ -264,10 +264,13 @@ false positive rates:
| 41 | 0.001 % |
#### File Format
-The Bloom filter data of a column chunk, which contains the size of the filter
in bytes, the
-algorithm, the hash function and the Bloom filter bitset, is stored near the
footer. The Bloom
-filter data offset is stored in column chunk metadata. Here are Bloom filter
definitions in
-thrift:
+
+Each multi-block Bloom filter is required to work for only one column chunk.
The data of a multi-block
+bloom filter consists of the bloom filter header followed by the bloom filter
bitset. The bloom filter
+header encodes the size of the bloom filter bit set in bytes that is used to
read the bitset.
+
+Here are the Bloom filter definitions in thrift:
+
```
/** Block-based algorithm type annotation. **/
@@ -323,6 +326,13 @@ struct ColumnMetaData {
```
+The Bloom filters are grouped by row group and with data for each column in
the same order as the file schema.
+The Bloom filter data can be stored before the page indexes after all row
groups. The file layout looks like:
+ 
+
+Or it can be stored between row groups, the file layout looks like:
+ 
+
#### Encryption
In the case of columns with sensitive data, the Bloom filter exposes a subset
of sensitive
information such as the presence of value. Therefore the Bloom filter of
columns with sensitive
diff --git a/doc/images/FileLayoutBloomFilter1.png
b/doc/images/FileLayoutBloomFilter1.png
new file mode 100644
index 0000000..3b21738
Binary files /dev/null and b/doc/images/FileLayoutBloomFilter1.png differ
diff --git a/doc/images/FileLayoutBloomFilter2.png
b/doc/images/FileLayoutBloomFilter2.png
new file mode 100755
index 0000000..6bbf770
Binary files /dev/null and b/doc/images/FileLayoutBloomFilter2.png differ