Riza Suminto created IMPALA-12451:
-------------------------------------

             Summary: Cardinality underestimation can hurt bloom filter 
effectiveness
                 Key: IMPALA-12451
                 URL: https://issues.apache.org/jira/browse/IMPALA-12451
             Project: IMPALA
          Issue Type: Improvement
          Components: Frontend
    Affects Versions: Impala 4.2.0
            Reporter: Riza Suminto
         Attachments: 53.txt, 79.txt

Impala planner select desired bloom filter size by estimating the NDV of values 
and target FPP (currently default at 0.75). Starting from IMPALA-11924, the NDV 
itself is estimated by taking the min between the input cardinality going to 
the join builder vs the column's stats NDV.

If Planner underestimate the input cardinality, it can select bloom filter size 
that is too small to fit the actual row NDV from the execution, rendering the 
filter ineffective (has big actual false-positive rate). Example of this case 
can be observed at RF004 of Q53 and RF006 of Q79 from TPC-DS 3TB run with 
RUNTIME_FILTER_MIN_SIZE=8KB (profiles attached).

To be specific:
||query||filter||column||stats NDV||est cardinality||selected size||actual 
cardinality||best min size||
|Q53|RF004|i_item_sk|185571|51|8KB (2^13)|18.53K|8MB (2^23)|
|Q79|RF006|hd_demo_sk|7200|720|8KB (2^13)|5.04K|2MB (2^21)|

The cardinality underestimation can be attributed to bad selectivity estimate 
in the build hand side of the join node producing that filters. Correct bloom 
filter size will require fixing this selectivity estimation or add an 
optimization to also consider stats NDV if cardinality estimate seems to be 
severely underestimated.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to