Joe McDonnell has uploaded this change for review. ( http://gerrit.cloudera.org:8080/18807
Change subject: IMPALA-11468: Port "Block Bloom filter false positive correction" from Kudu ...................................................................... IMPALA-11468: Port "Block Bloom filter false positive correction" from Kudu Block Bloom filters have a higher false positive rate than standard Bloom filter, due to the uneven distribution of keys between buckets. This patch changes the code to match the theory, using an approximation from the paper that introduced block Bloom filters, "Cache-, Hash- and Space-Efficient Bloom Filters" by Putze et al. In scan_predicate.cc, filters are created with BlockBloomFilter::MinLogSpace. Prior to this patch, that method will sometimes return a value that is lower than the true answer, leading to smaller filters and higher false positive probabilities than expected. This patch corrects BlockBloomFilter::MinLogSpace, leading to filters having the expected false positive rate by dint of their larger size. The performance impact here is dependent on the extent than a scan is bottlenecked by heap space for the filter vs. compute time for the scan predicate application to filter false positives. For a false positive probability of 1%, as is currently set in scan_predicate.cc, this patch leads to an increase in filter size of about 10% and a decrease in filter false positive probability by 50%. However, this is obscured by the coarseness of the fact that filters are constrained to have a size in bytes that is a power of two. Loosening that restriction is potential future work. Porting Notes: - The MaxNdv() function is not present in Impala, so it is omitted. - This resolves a test failure for ParquetBloomFilter.FindInvalid when building with GCC 10 and the associated libstdc++. - This adds a comment noting that the test is also dependent on the libstdc++ implementation of unordered_set. Change-Id: Ic992e47976274e3ef0db3633d38e5a8e886274b4 --- M be/src/util/parquet-bloom-filter-test.cc M be/src/util/parquet-bloom-filter.cc 2 files changed, 74 insertions(+), 22 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/07/18807/1 -- To view, visit http://gerrit.cloudera.org:8080/18807 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: Ic992e47976274e3ef0db3633d38e5a8e886274b4 Gerrit-Change-Number: 18807 Gerrit-PatchSet: 1 Gerrit-Owner: Joe McDonnell <[email protected]>
