Joe McDonnell has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/18807


Change subject: IMPALA-11468: Port "Block Bloom filter false positive 
correction" from Kudu
......................................................................

IMPALA-11468: Port "Block Bloom filter false positive correction" from Kudu

Block Bloom filters have a higher false positive rate than standard
Bloom filter, due to the uneven distribution of keys between
buckets. This patch changes the code to match the theory, using an
approximation from the paper that introduced block Bloom filters,
"Cache-, Hash- and Space-Efficient Bloom Filters" by Putze et al.

In scan_predicate.cc, filters are created with
BlockBloomFilter::MinLogSpace. Prior to this patch, that method will
sometimes return a value that is lower than the true answer, leading
to smaller filters and higher false positive probabilities than
expected. This patch corrects BlockBloomFilter::MinLogSpace, leading
to filters having the expected false positive rate by dint of their
larger size. The performance impact here is dependent on the extent
than a scan is bottlenecked by heap space for the filter vs. compute
time for the scan predicate application to filter false positives.

For a false positive probability of 1%, as is currently set in
scan_predicate.cc, this patch leads to an increase in filter size of
about 10% and a decrease in filter false positive probability by
50%. However, this is obscured by the coarseness of the fact that
filters are constrained to have a size in bytes that is a power of
two. Loosening that restriction is potential future work.

Porting Notes:
 - The MaxNdv() function is not present in Impala, so it is omitted.
 - This resolves a test failure for ParquetBloomFilter.FindInvalid
   when building with GCC 10 and the associated libstdc++.
 - This adds a comment noting that the test is also dependent
   on the libstdc++ implementation of unordered_set.

Change-Id: Ic992e47976274e3ef0db3633d38e5a8e886274b4
---
M be/src/util/parquet-bloom-filter-test.cc
M be/src/util/parquet-bloom-filter.cc
2 files changed, 74 insertions(+), 22 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/07/18807/1
--
To view, visit http://gerrit.cloudera.org:8080/18807
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ic992e47976274e3ef0db3633d38e5a8e886274b4
Gerrit-Change-Number: 18807
Gerrit-PatchSet: 1
Gerrit-Owner: Joe McDonnell <[email protected]>

Reply via email to