[ 
https://issues.apache.org/jira/browse/IMPALA-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Riza Suminto updated IMPALA-12451:
----------------------------------
    Description: 
Impala planner select desired bloom filter size by estimating the NDV of values 
and target FPP (currently default at 0.75). Starting from IMPALA-11924, the NDV 
itself is estimated by taking the min between the input cardinality going to 
the join builder vs the column's stats NDV.

If Planner underestimate the input cardinality, it can select bloom filter size 
that is too small to fit the actual row NDV from the execution, rendering the 
filter ineffective (has big actual false-positive rate). Example of this case 
can be observed at RF004 of Q53 from TPC-DS 3TB run with 
RUNTIME_FILTER_MIN_SIZE=8KB ([^53.txt]).

To be specific:
||query||filter||column||stats NDV||est cardinality||selected size||actual 
cardinality||ndv based min size||
|Q53|RF004|i_item_sk|360000|51|8KB (2^13)|18.53K|128KB (2^17)|

For RF004, the cardinality underestimation can be attributed to bad selectivity 
estimate in the build hand side of the join node producing that filters. The 
actual cardinality 18.53K is still within the limit of 8KB bloom filter size, 
but since the target FPP is 0.75, it still produce high actual false-positive 
rate, passing out more rows.

Getting better bloom filter size will require fixing this selectivity 
estimation, reducing target fpp lower than current default (0.75), or add an 
optimization to also consider stats NDV if cardinality estimate seems to be 
severely underestimated. [^53_double_filter_size.txt] shows that increasing 
RF004 size can lead to better row filtering.

  was:
Impala planner select desired bloom filter size by estimating the NDV of values 
and target FPP (currently default at 0.75). Starting from IMPALA-11924, the NDV 
itself is estimated by taking the min between the input cardinality going to 
the join builder vs the column's stats NDV.

If Planner underestimate the input cardinality, it can select bloom filter size 
that is too small to fit the actual row NDV from the execution, rendering the 
filter ineffective (has big actual false-positive rate). Example of this case 
can be observed at RF004 of Q53 and RF006 of Q79 from TPC-DS 3TB run with 
RUNTIME_FILTER_MIN_SIZE=8KB (profiles attached).

To be specific:
||query||filter||column||stats NDV||est cardinality||selected size||actual 
cardinality||best min size||
|Q53|RF004|i_item_sk|185571|51|8KB (2^13)|18.53K|8MB (2^23)|
|Q79|RF006|hd_demo_sk|7200|720|8KB (2^13)|5.04K|2MB (2^21)|

The cardinality underestimation can be attributed to bad selectivity estimate 
in the build hand side of the join node producing that filters. Correct bloom 
filter size will require fixing this selectivity estimation or add an 
optimization to also consider stats NDV if cardinality estimate seems to be 
severely underestimated.


> Cardinality underestimation can hurt bloom filter effectiveness
> ---------------------------------------------------------------
>
>                 Key: IMPALA-12451
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12451
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 4.2.0
>            Reporter: Riza Suminto
>            Priority: Major
>              Labels: bloom-filter, runtime-filters
>         Attachments: 53.txt, 53_double_filter_size.txt
>
>
> Impala planner select desired bloom filter size by estimating the NDV of 
> values and target FPP (currently default at 0.75). Starting from 
> IMPALA-11924, the NDV itself is estimated by taking the min between the input 
> cardinality going to the join builder vs the column's stats NDV.
> If Planner underestimate the input cardinality, it can select bloom filter 
> size that is too small to fit the actual row NDV from the execution, 
> rendering the filter ineffective (has big actual false-positive rate). 
> Example of this case can be observed at RF004 of Q53 from TPC-DS 3TB run with 
> RUNTIME_FILTER_MIN_SIZE=8KB ([^53.txt]).
> To be specific:
> ||query||filter||column||stats NDV||est cardinality||selected size||actual 
> cardinality||ndv based min size||
> |Q53|RF004|i_item_sk|360000|51|8KB (2^13)|18.53K|128KB (2^17)|
> For RF004, the cardinality underestimation can be attributed to bad 
> selectivity estimate in the build hand side of the join node producing that 
> filters. The actual cardinality 18.53K is still within the limit of 8KB bloom 
> filter size, but since the target FPP is 0.75, it still produce high actual 
> false-positive rate, passing out more rows.
> Getting better bloom filter size will require fixing this selectivity 
> estimation, reducing target fpp lower than current default (0.75), or add an 
> optimization to also consider stats NDV if cardinality estimate seems to be 
> severely underestimated. [^53_double_filter_size.txt] shows that increasing 
> RF004 size can lead to better row filtering.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to