[
https://issues.apache.org/jira/browse/IMPALA-11924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17748742#comment-17748742
]
ASF subversion and git services commented on IMPALA-11924:
----------------------------------------------------------
Commit 12dee267fc56bcfef757285d1c698dfc241d2a05 in impala's branch
refs/heads/master from Csaba Ringhofer
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=12dee267f ]
IMPALA-11924: Cap runtime filter NDV with build key NDV
Before this patch, the NDV used for bloom filter sizing was based only
on the cardinality of the build side. This is ok for FK/PK joins but
can highly overestimate NDV if the build key column's NDV is smaller
than the number of rows.
This change takes the minimum of NDV (not changed by selectiveness)
and cardinality (reduced by selectiveness).
Testing:
- Adjust test_bloom_filters and test_row_filters, raising the NDV of
the test case such that the assertion is maintained.
- Add 8KB bloom filter test case in test_bloom_filters.
Change-Id: Idaa46789663cb2e6d29f518757d89c85ff8e4d1a
Reviewed-on: http://gerrit.cloudera.org:8080/19506
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Michael Smith <[email protected]>
> Bloom filter size is unaffected by column NDV
> ---------------------------------------------
>
> Key: IMPALA-11924
> URL: https://issues.apache.org/jira/browse/IMPALA-11924
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Reporter: Csaba Ringhofer
> Assignee: Csaba Ringhofer
> Priority: Major
> Labels: bloom-filter, runtime-filters
>
> For bloom filter sizing Impala simply uses the the cardinality of the build
> side while it could be clearly capped by NDV:
> https://github.com/apache/impala/blob/feb4a76ed4cb5b688143eb21370f78ec93133c56/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java#L661
> https://github.com/apache/impala/blob/feb4a76ed4cb5b688143eb21370f78ec93133c56/fe/src/main/java/org/apache/impala/planner/RuntimeFilterGenerator.java#L698
> E.g.:
> {code}
> use tpch_parquet;
> set RUNTIME_FILTER_MIN_SIZE=8192
> RUNTIME_FILTER_MIN_SIZE
> explain select count(*) from orders join customer on o_comment = c_mktsegment
> PLAN-ROOT SINK
> |
> 06:AGGREGATE [FINALIZE]
> | output: count:merge(*)
> | row-size=8B cardinality=1
> |
> 05:EXCHANGE [UNPARTITIONED]
> |
> 03:AGGREGATE
> | output: count(*)
> | row-size=8B cardinality=1
> |
> 02:HASH JOIN [INNER JOIN, BROADCAST]
> | hash predicates: o_comment = c_mktsegment
> | runtime filters: RF000 <- c_mktsegment
> | row-size=82B cardinality=162.03K
> |
> |--04:EXCHANGE [BROADCAST]
> | |
> | 01:SCAN HDFS [tpch_parquet.customer]
> | HDFS partitions=1/1 files=1 size=12.34MB
> | row-size=21B cardinality=150.00K
> |
> 00:SCAN HDFS [tpch_parquet.orders]
> HDFS partitions=1/1 files=2 size=54.21MB
> runtime filters: RF000 -> o_comment
> row-size=61B cardinality=1.50M
> {code}
> The query above set RF000's size to 65536, while the minimum 8192 would be
> more than enough, as the ndv of c_mktsegment is 5
> The current logic should work well for FK/PK joins where the build size's
> cardinality is close the PK's ndv, but can massively overestimate large
> tables with small ndv keys.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]