[ https://issues.apache.org/jira/browse/IMPALA-5633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17184795#comment-17184795 ]
Tim Armstrong commented on IMPALA-5633: --------------------------------------- [~jbapple] I know you're looking at some related things in Kudu now. Do you have any idea why max_filter_error_rate might be set to 0.75 - that seems awfully high. On some queries - TPC-DS Q84 for example, it seems to be too aggressive and is choosing filters that are too small and end up being less effective than desired. E.g. a 4kb filter for household_demographics.hd_demo_sk, which has 7200 distinct values. > Bloom filters underestimate false positive probability for high NDV > ------------------------------------------------------------------- > > Key: IMPALA-5633 > URL: https://issues.apache.org/jira/browse/IMPALA-5633 > Project: IMPALA > Issue Type: Bug > Components: Perf Investigation > Reporter: Jim Apple > Assignee: Jim Apple > Priority: Minor > > {{bloom-filter.cc}} says: > {noformat} > fpp = (1 - exp(-BUCKET_WORDS * ndv/space))^BUCKET_WORDS > where space is in bits. > {noformat} > This is true only discounting the false positive rate induced by hash > collisions. Using {{w}}}-bit hashes, with {{n}} distinct values gives a false > positive rate of > {noformat} > n / exp2(w) + (1.0 - n / exp2(w)) * ((1 - exp(-BUCKET_WORDS * > ndv/space))^BUCKET_WORDS) > {noformat} > This starts to become a factor as {{n}} approaches {{1 << w}}. It also > suggests using bitmaps rather than Bloom filters for large {{n}}, since the > false positive rate for a bitmap is > {noformat} > n / exp2(w) + (1.0 - n / exp2(w)) * (1 - exp(-ndv/space)) > {noformat} > This is lower than the current BF false positive rate for high {{n}} and low > relative space (aka, high false positive probability). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org