[
https://issues.apache.org/jira/browse/IMPALA-12451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17766573#comment-17766573
]
Kurt Deschler commented on IMPALA-12451:
----------------------------------------
Perhaps we should consider increasing RUNTIME_FILTER_MIN_SIZE or making that
sizing more dynamic depending on the size of the query and overall query memory?
> Cardinality underestimation can hurt bloom filter effectiveness
> ---------------------------------------------------------------
>
> Key: IMPALA-12451
> URL: https://issues.apache.org/jira/browse/IMPALA-12451
> Project: IMPALA
> Issue Type: Improvement
> Components: Frontend
> Affects Versions: Impala 4.2.0
> Reporter: Riza Suminto
> Priority: Major
> Labels: bloom-filter, runtime-filters
> Attachments: 53.txt, 79.txt
>
>
> Impala planner select desired bloom filter size by estimating the NDV of
> values and target FPP (currently default at 0.75). Starting from
> IMPALA-11924, the NDV itself is estimated by taking the min between the input
> cardinality going to the join builder vs the column's stats NDV.
> If Planner underestimate the input cardinality, it can select bloom filter
> size that is too small to fit the actual row NDV from the execution,
> rendering the filter ineffective (has big actual false-positive rate).
> Example of this case can be observed at RF004 of Q53 and RF006 of Q79 from
> TPC-DS 3TB run with RUNTIME_FILTER_MIN_SIZE=8KB (profiles attached).
> To be specific:
> ||query||filter||column||stats NDV||est cardinality||selected size||actual
> cardinality||best min size||
> |Q53|RF004|i_item_sk|185571|51|8KB (2^13)|18.53K|8MB (2^23)|
> |Q79|RF006|hd_demo_sk|7200|720|8KB (2^13)|5.04K|2MB (2^21)|
> The cardinality underestimation can be attributed to bad selectivity estimate
> in the build hand side of the join node producing that filters. Correct bloom
> filter size will require fixing this selectivity estimation or add an
> optimization to also consider stats NDV if cardinality estimate seems to be
> severely underestimated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]