[ 
https://issues.apache.org/jira/browse/PARQUET-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16647243#comment-16647243
 ] 

ASF GitHub Bot commented on PARQUET-41:
---------------------------------------

majetideepak closed pull request #112: PARQUET-41: Add Bloom filter
URL: https://github.com/apache/parquet-format/pull/112
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/BloomFilter.md b/BloomFilter.md
new file mode 100644
index 00000000..be27aefb
--- /dev/null
+++ b/BloomFilter.md
@@ -0,0 +1,120 @@
+ <!--
+  - Licensed to the Apache Software Foundation (ASF) under one
+  - or more contributor license agreements.  See the NOTICE file
+  - distributed with this work for additional information
+  - regarding copyright ownership.  The ASF licenses this file
+  - to you under the Apache License, Version 2.0 (the
+  - "License"); you may not use this file except in compliance
+  - with the License.  You may obtain a copy of the License at
+  -
+  -   http://www.apache.org/licenses/LICENSE-2.0
+  -
+  - Unless required by applicable law or agreed to in writing,
+  - software distributed under the License is distributed on an
+  - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  - KIND, either express or implied.  See the License for the
+  - specific language governing permissions and limitations
+  - under the License.
+  -->
+
+Parquet Bloom Filter
+===
+### Problem statement
+In their current format, column statistics and dictionaries can be used for 
predicate
+pushdown. Statistics include minimum and maximum value, which can be used to 
filter out
+values not in the range. Dictionaries are more specific, and readers can 
filter out values
+that are between min and max but not in the dictionary. However, when there 
are too many
+distinct values, writers sometimes choose not to add dictionaries because of 
the extra
+space they occupy. This leaves columns with large cardinalities and widely 
separated min
+and max without support for predicate pushdown.
+
+A Bloom filter[1] is a compact data structure that overapproximates a set. It 
can respond
+to membership queries with either "definitely no" or "probably yes", where the 
probability
+of false positives is configured when the filter is initialized. Bloom filters 
do not have
+false negatives.
+
+Because Bloom filters are small compared to dictionaries, they can be used for 
predicate
+pushdown even in columns with high cardinality and when space is at a premium.
+
+### Goal
+* Enable predicate pushdown for high-cardinality columns while using less 
space than
+  dictionaries.
+
+* Induce no additional I/O overhead when executing queries on columns without 
Bloom
+  filters attached or when executing non-selective queries.
+
+### Technical Approach
+The initial Bloom filter algorithm in Parquet is implemented using a 
combination of two
+Bloom filter techniques.
+
+First, the block Bloom filter algorithm from Putze et al.'s "Cache-, Hash- and
+Space-Efficient Bloom filters"[2] is used. This divides a filter into many 
tiny Bloom
+filters, each one of which is called a "block". In Parquet's initial 
implementation, each
+block is 256 bits. When inserting or finding a value, part of the hash of that 
value is
+used to index into the array of blocks and pick a single one. This single 
block is then
+used for the remaining part of the operation.
+
+Second, within each block, this implementation uses the folklore split Bloom 
filter
+technique, as described in section 2.1 of "Network Applications of Bloom 
Filters: A
+Survey"[5]. This divides the 256 bits in each block up into eight contiguous 
32-bit lanes
+and sets or checks one bit in each lane.
+
+#### Algorithm
+In the initial algorithm, the most significant 32 bits from the hash value are 
used as the
+index to select a block from bitset. The lower 32 bits of the hash value, 
along with eight
+constant salt values, are used to compute the bit to set in each lane of the 
block. The
+salt and lower 32 bits are combined using the multiply-shift[3] hash function:
+
+```c
+// 8 SALT values used to compute bit pattern
+static const uint32_t SALT[8] = {0x47b6137bU, 0x44974d91U, 0x8824ad5bU, 
0xa2b7289dU,
+  0x705495c7U, 0x2df1424bU, 0x9efc4947U, 0x5c6bfb31U};
+
+// key: the lower 32 bits of hash result
+// mask: the output bit pattern for a tiny Bloom filter
+void Mask(uint32_t key, uint32_t mask[8]) {
+  for (int i = 0; i < 8; ++i) {
+    mask[i] = key * SALT[i];
+  }
+  for (int i = 0; i < 8; ++i) {
+    mask[i] = mask[i] >> 27;
+  }
+  for (int i = 0; i < 8; ++i) {
+    mask[i] = UINT32_C(1) << mask[i];
+  }
+}
+
+```
+
+#### Hash Function
+The function used to hash values in the initial implementation is 
MurmurHash3[4], using
+the least-significant 64 bits of the 128-bit version of the function on the 
x86-64
+platform. Note that the function produces different values on different 
architectures, so
+implementors must be careful to use the version specific to x86-64. That 
function can be
+emulated on different platforms without difficulty.
+
+#### Build a Bloom filter
+The fact that exactly eight bits are checked during each lookup means that 
these filters
+are most space efficient when used with an expected false positive rate of 
about
+0.5%. This is achieved when there are about 11.54 bits for every distinct 
value inserted
+into the filter.
+
+To calculate the size the filter should be for another false positive rate 
`p`, use the
+following formula. The output is in bits per distinct element:
+
+```c
+-8 / log(1 - pow(p, 1.0 / 8));
+```
+
+#### File Format
+The Bloom filter data of a column is stored at the beginning of its column 
chunk in the
+row group. The column chunk metadata contains the Bloom filter offset. The 
Bloom filter is
+stored with a header containing the size of the filter in bytes, the 
algorithm, and the
+hash function.
+
+### Reference
+1. [Bloom filter introduction at 
Wiki](https://en.wikipedia.org/wiki/Bloom_filter)
+2. [Cache-, Hash- and Space-Efficient Bloom 
Filters](http://algo2.iti.kit.edu/documents/cacheefficientbloomfilters-jea.pdf)
+3. [A Reliable Randomized Algorithm for the Closest-Pair 
Problem](http://www.diku.dk/~jyrki/Paper/CP-11.4.1997.ps)
+4. [Murmur Hash at Wiki](https://en.wikipedia.org/wiki/MurmurHash)
+5. [Network Applications of Bloom Filters: A 
Survey](https://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf)
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 6c9011b9..378aa473 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -475,6 +475,7 @@ enum PageType {
   INDEX_PAGE = 1;
   DICTIONARY_PAGE = 2;
   DATA_PAGE_V2 = 3;
+  BLOOM_FILTER_PAGE = 4;
 }
 
 /**
@@ -554,6 +555,38 @@ struct DataPageHeaderV2 {
   8: optional Statistics statistics;
 }
 
+/** Block-based algorithm type annotation. **/
+struct SplitBlockAlgorithm {}
+/** The algorithm used in Bloom filter. **/
+union BloomFilterAlgorithm {
+  /** Block-based Bloom filter. **/
+  1: SplitBlockAlgorithm BLOCK;
+}
+/** Hash strategy type annotation. It uses Murmur3Hash_x64_128 from the 
original SMHasher
+ * repo by Austin Appleby.
+ **/
+struct Murmur3 {}
+/** 
+ * The hash function used in Bloom filter. This function takes the hash of a 
column value
+ * using plain encoding.
+ **/
+union BloomFilterHash {
+  /** Murmur3 Hash Strategy. **/
+  1: Murmur3 MURMUR3;
+}
+/**
+  * Bloom filter header is stored at beginning of Bloom filter data of each 
column
+  * and followed by its bitset.
+  **/
+struct BloomFilterPageHeader {
+  /** The size of bitset in bytes **/
+  1: required i32 numBytes;
+  /** The algorithm for setting bits. **/
+  2: required BloomFilterAlgorithm algorithm;
+  /** The hash function used for Bloom filter. **/
+  3: required BloomFilterHash hash;
+}
+
 struct PageHeader {
   /** the type of the page: indicates which of the *_header fields is set **/
   1: required PageType type
@@ -574,6 +607,7 @@ struct PageHeader {
   6: optional IndexPageHeader index_page_header;
   7: optional DictionaryPageHeader dictionary_page_header;
   8: optional DataPageHeaderV2 data_page_header_v2;
+  9: optional BloomFilterPageHeader bloom_filter_page_header;
 }
 
 /**
@@ -660,6 +694,9 @@ struct ColumnMetaData {
    * This information can be used to determine if all data pages are
    * dictionary encoded for example **/
   13: optional list<PageEncodingStats> encoding_stats;
+
+  /** Byte offset from beginning of file to Bloom filter data. **/
+  14: optional i64 bloom_filter_offset;
 }
 
 struct ColumnChunk {


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Add bloom filters to parquet statistics
> ---------------------------------------
>
>                 Key: PARQUET-41
>                 URL: https://issues.apache.org/jira/browse/PARQUET-41
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-format, parquet-mr
>            Reporter: Alex Levenson
>            Assignee: Junjie Chen
>            Priority: Major
>              Labels: filter2, pull-request-available
>
> For row groups with no dictionary, we could still produce a bloom filter. 
> This could be very useful in filtering entire row groups.
> Pull request:
> https://github.com/apache/parquet-mr/pull/215



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to