Qifan Chen created IMPALA-10602:
-----------------------------------
Summary: Intersection of multiple min/max filters when applying on
common join column
Key: IMPALA-10602
URL: https://issues.apache.org/jira/browse/IMPALA-10602
Project: IMPALA
Issue Type: Improvement
Reporter: Qifan Chen
Currently, Impala actually generates two min/max filters from the two joins,
for a test query as follows.
{quote}select straight_join count(*)
from store_sales ss, date_dim d1, date_dim d2
where
ss.ss_sold_time_sk = d1.d_date_sk and
ss.ss_sold_time_sk = d2.d_date_sk;{quote}
{quote}| 00:SCAN HDFS [tpcds_parquet.store_sales ss, RANDOM]
|
| HDFS partitions=1824/1824 files=1824 size=200.94MB
|
| runtime filters: RF001[min_max] -> ss.ss_sold_time_sk, RF003[min_max] ->
ss.ss_sold_time_sk, RF000[bloom] -> ss.ss_sold_time_sk, RF002[bloom] ->
ss.ss_sold_time_sk |
| stored statistics:
|
| table: rows=2.88M size=200.94MB
|
| partitions: 1824/1824 rows=2.88M
|
| columns: all
|
| extrapolated-rows=disabled max-scan-range-rows=130.09K
|
| file formats: [PARQUET]
|
| mem-estimate=16.00MB mem-reservation=512.00KB thread-reservation=1
|
| tuple-ids=0 row-size=4B cardinality=2.88M{quote}
It will be more efficient to combine RF001 and RF003 into a single filter and
apply instead. Suppose the range of RF001 is [10, 50] and that for RF003 is
[20, 60] then the combined filter will be [20, 50].
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]