Riza Suminto created IMPALA-12451:
-------------------------------------
Summary: Cardinality underestimation can hurt bloom filter
effectiveness
Key: IMPALA-12451
URL: https://issues.apache.org/jira/browse/IMPALA-12451
Project: IMPALA
Issue Type: Improvement
Components: Frontend
Affects Versions: Impala 4.2.0
Reporter: Riza Suminto
Attachments: 53.txt, 79.txt
Impala planner select desired bloom filter size by estimating the NDV of values
and target FPP (currently default at 0.75). Starting from IMPALA-11924, the NDV
itself is estimated by taking the min between the input cardinality going to
the join builder vs the column's stats NDV.
If Planner underestimate the input cardinality, it can select bloom filter size
that is too small to fit the actual row NDV from the execution, rendering the
filter ineffective (has big actual false-positive rate). Example of this case
can be observed at RF004 of Q53 and RF006 of Q79 from TPC-DS 3TB run with
RUNTIME_FILTER_MIN_SIZE=8KB (profiles attached).
To be specific:
||query||filter||column||stats NDV||est cardinality||selected size||actual
cardinality||best min size||
|Q53|RF004|i_item_sk|185571|51|8KB (2^13)|18.53K|8MB (2^23)|
|Q79|RF006|hd_demo_sk|7200|720|8KB (2^13)|5.04K|2MB (2^21)|
The cardinality underestimation can be attributed to bad selectivity estimate
in the build hand side of the join node producing that filters. Correct bloom
filter size will require fixing this selectivity estimation or add an
optimization to also consider stats NDV if cardinality estimate seems to be
severely underestimated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]