[ 
https://issues.apache.org/jira/browse/IMPALA-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Riza Suminto updated IMPALA-12451:
----------------------------------
    Attachment:     (was: 79.txt)

> Cardinality underestimation can hurt bloom filter effectiveness
> ---------------------------------------------------------------
>
>                 Key: IMPALA-12451
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12451
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Frontend
>    Affects Versions: Impala 4.2.0
>            Reporter: Riza Suminto
>            Priority: Major
>              Labels: bloom-filter, runtime-filters
>         Attachments: 53.txt, 53_double_filter_size.txt
>
>
> Impala planner select desired bloom filter size by estimating the NDV of 
> values and target FPP (currently default at 0.75). Starting from 
> IMPALA-11924, the NDV itself is estimated by taking the min between the input 
> cardinality going to the join builder vs the column's stats NDV.
> If Planner underestimate the input cardinality, it can select bloom filter 
> size that is too small to fit the actual row NDV from the execution, 
> rendering the filter ineffective (has big actual false-positive rate). 
> Example of this case can be observed at RF004 of Q53 from TPC-DS 3TB run with 
> RUNTIME_FILTER_MIN_SIZE=8KB ([^53.txt]).
> To be specific:
> ||query||filter||column||stats NDV||est cardinality||selected size||actual 
> cardinality||ndv based min size||
> |Q53|RF004|i_item_sk|360000|51|8KB (2^13)|18.53K|128KB (2^17)|
> For RF004, the cardinality underestimation can be attributed to bad 
> selectivity estimate in the build hand side of the join node producing that 
> filters. The actual cardinality 18.53K is still within the limit of 8KB bloom 
> filter size, but since the target FPP is 0.75, it still produce high actual 
> false-positive rate, passing out more rows.
> Getting better bloom filter size will require fixing this selectivity 
> estimation, reducing target fpp lower than current default (0.75), or add an 
> optimization to also consider stats NDV if cardinality estimate seems to be 
> severely underestimated. [^53_double_filter_size.txt] shows that increasing 
> RF004 size can lead to better row filtering.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to