[ 
https://issues.apache.org/jira/browse/IMPALA-11924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Riza Suminto resolved IMPALA-11924.
-----------------------------------
    Fix Version/s: Impala 4.3.0
       Resolution: Fixed

> Bloom filter size is unaffected by column NDV
> ---------------------------------------------
>
>                 Key: IMPALA-11924
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11924
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: Csaba Ringhofer
>            Assignee: Csaba Ringhofer
>            Priority: Major
>              Labels: bloom-filter, runtime-filters
>             Fix For: Impala 4.3.0
>
>
> For bloom filter sizing Impala simply uses the the cardinality of the build 
> side while it could be clearly capped by NDV:
> https://github.com/apache/impala/blob/feb4a76ed4cb5b688143eb21370f78ec93133c56/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java#L661
> https://github.com/apache/impala/blob/feb4a76ed4cb5b688143eb21370f78ec93133c56/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java#L698
> E.g.:
> {code}
> use tpch_parquet;
> set RUNTIME_FILTER_MIN_SIZE=8192
> RUNTIME_FILTER_MIN_SIZE
> explain select count(*) from orders join customer on o_comment = c_mktsegment
> PLAN-ROOT SINK
> |
> 06:AGGREGATE [FINALIZE]
> |  output: count:merge(*)
> |  row-size=8B cardinality=1
> |
> 05:EXCHANGE [UNPARTITIONED]
> |
> 03:AGGREGATE
> |  output: count(*)
> |  row-size=8B cardinality=1
> |
> 02:HASH JOIN [INNER JOIN, BROADCAST]
> |  hash predicates: o_comment = c_mktsegment
> |  runtime filters: RF000 <- c_mktsegment
> |  row-size=82B cardinality=162.03K
> |
> |--04:EXCHANGE [BROADCAST]
> |  |
> |  01:SCAN HDFS [tpch_parquet.customer]
> |     HDFS partitions=1/1 files=1 size=12.34MB
> |     row-size=21B cardinality=150.00K
> |
> 00:SCAN HDFS [tpch_parquet.orders]
>    HDFS partitions=1/1 files=2 size=54.21MB
>    runtime filters: RF000 -> o_comment
>    row-size=61B cardinality=1.50M
> {code}
> The query above set RF000's size to 65536, while the minimum 8192 would be 
> more than enough, as the ndv of c_mktsegment is 5
> The current logic should work well for FK/PK joins where the build size's 
> cardinality is close the PK's ndv, but can massively overestimate large 
> tables with small ndv keys.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to