[
https://issues.apache.org/jira/browse/IMPALA-8165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Greg Rahn updated IMPALA-8165:
------------------------------
Description:
If we take a simple query like such:
{noformat}
select avg(ss_quantity)
from store_sales
join household_demographics on (ss_hdemo_sk=hd_demo_sk)
where (ss_sales_price between 0 and 100 and hd_dep_count = 1)
or (ss_sales_price between 100 and 200 and hd_dep_count = 2);
{noformat}
and look at the plan we see that neither table scan has any predicates pushed
to it, the only filter is in the join
(from impalad version 2.12.0-cdh5.16.x RELEASE (build
3f68649c7bf8a01fb6ba0cbe35dd2492adb836dd)
{noformat}
PLAN-ROOT SINK
|
06:AGGREGATE [FINALIZE]
| output: avg:merge(ss_quantity)
|
05:EXCHANGE [UNPARTITIONED]
|
03:AGGREGATE
| output: avg(ss_quantity)
|
02:HASH JOIN [INNER JOIN, BROADCAST]
| hash predicates: ss_hdemo_sk = hd_demo_sk
| other predicates: (ss_sales_price >= 0 AND ss_sales_price <= 100 AND
hd_dep_count = 1) OR
| (ss_sales_price >= 100 AND ss_sales_price <= 200
AND hd_dep_count = 2)
| runtime filters: RF000 <- hd_demo_sk
|
|--04:EXCHANGE [BROADCAST]
| |
| 01:SCAN HDFS [tpcds_1000_parquet.household_demographics]
| partitions=1/1 files=1 size=41.08KB
|
00:SCAN HDFS [tpcds_1000_parquet.store_sales]
partitions=1824/1824 files=1824 size=189.24GB
runtime filters: RF000 -> ss_hdemo_sk
{noformat}
If we look at PostgreSQL 11.1, we'll see that not only does the join filter,
but the table scan has the appropriate filters pushed to it.
{noformat}
Finalize Aggregate (cost=67549.69..67549.70 rows=1 width=32)
-> Gather (cost=67549.47..67549.68 rows=2 width=32)
Workers Planned: 2
-> Partial Aggregate (cost=66549.47..66549.48 rows=1 width=32)
-> Hash Join (cost=113.12..66547.64 rows=734 width=4)
Hash Cond: (store_sales.ss_hdemo_sk =
household_demographics.hd_demo_sk)
Join Filter: (((store_sales.ss_sales_price >=
'0'::numeric) AND
(store_sales.ss_sales_price <=
'100'::numeric) AND
(household_demographics.hd_dep_count = 1))
OR
((store_sales.ss_sales_price >=
'100'::numeric) AND
(store_sales.ss_sales_price <=
'200'::numeric) AND
(household_demographics.hd_dep_count = 2)))
-> Parallel Seq Scan on store_sales (cost=0.00..66343.20
rows=7305 width=22)
Filter: (((ss_sales_price >= '0'::numeric) AND
(ss_sales_price <= '100'::numeric)) OR
((ss_sales_price >= '100'::numeric) AND
(ss_sales_price <= '200'::numeric)))
-> Hash (cost=112.62..112.62 rows=40 width=8)
-> Seq Scan on household_demographics
(cost=0.00..112.62 rows=40 width=8)
Filter: ((hd_dep_count = 1) OR (hd_dep_count =
2))
{noformat}
was:
If we take a simple query like such:
{noformat}
select avg(ss_quantity)
from store_sales
join household_demographics on (ss_hdemo_sk=hd_demo_sk)
where (ss_sales_price between 0 and 100 and hd_dep_count = 1)
or (ss_sales_price between 100 and 200 and hd_dep_count = 2);
{noformat}
and look at the plan we see that neither table scan has any predicates pushed
to it, the only filter is in the join
(from impalad version 2.12.0-cdh5.16.x RELEASE (build
3f68649c7bf8a01fb6ba0cbe35dd2492adb836dd)
{noformat}
PLAN-ROOT SINK
|
06:AGGREGATE [FINALIZE]
| output: avg:merge(ss_quantity)
|
05:EXCHANGE [UNPARTITIONED]
|
03:AGGREGATE
| output: avg(ss_quantity)
|
02:HASH JOIN [INNER JOIN, BROADCAST]
| hash predicates: ss_hdemo_sk = hd_demo_sk
| other predicates: (ss_sales_price >= 0 AND ss_sales_price <= 100 AND
hd_dep_count = 1) OR (ss_sales_price >= 100 AND ss_sales_price <= 200 AND
hd_dep_count = 2)
| runtime filters: RF000 <- hd_demo_sk
|
|--04:EXCHANGE [BROADCAST]
| |
| 01:SCAN HDFS [tpcds_1000_parquet.household_demographics]
| partitions=1/1 files=1 size=41.08KB
|
00:SCAN HDFS [tpcds_1000_parquet.store_sales]
partitions=1824/1824 files=1824 size=189.24GB
runtime filters: RF000 -> ss_hdemo_sk
{noformat}
If we look at PostgreSQL 11.1, we'll see that not only does the join filter,
but the table scan has the appropriate filters pushed to it.
{noformat}
Finalize Aggregate (cost=67549.69..67549.70 rows=1 width=32)
-> Gather (cost=67549.47..67549.68 rows=2 width=32)
Workers Planned: 2
-> Partial Aggregate (cost=66549.47..66549.48 rows=1 width=32)
-> Hash Join (cost=113.12..66547.64 rows=734 width=4)
Hash Cond: (store_sales.ss_hdemo_sk =
household_demographics.hd_demo_sk)
Join Filter: (((store_sales.ss_sales_price >=
'0'::numeric) AND
(store_sales.ss_sales_price <=
'100'::numeric) AND
(household_demographics.hd_dep_count = 1))
OR
((store_sales.ss_sales_price >=
'100'::numeric) AND
(store_sales.ss_sales_price <=
'200'::numeric) AND
(household_demographics.hd_dep_count = 2)))
-> Parallel Seq Scan on store_sales (cost=0.00..66343.20
rows=7305 width=22)
Filter: (((ss_sales_price >= '0'::numeric) AND
(ss_sales_price <= '100'::numeric)) OR
((ss_sales_price >= '100'::numeric) AND
(ss_sales_price <= '200'::numeric)))
-> Hash (cost=112.62..112.62 rows=40 width=8)
-> Seq Scan on household_demographics
(cost=0.00..112.62 rows=40 width=8)
Filter: ((hd_dep_count = 1) OR (hd_dep_count =
2))
{noformat}
> Planner does not push through predicates when there is a disjunction
> --------------------------------------------------------------------
>
> Key: IMPALA-8165
> URL: https://issues.apache.org/jira/browse/IMPALA-8165
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Reporter: Greg Rahn
> Priority: Major
> Labels: planner, tpc-ds
>
> If we take a simple query like such:
> {noformat}
> select avg(ss_quantity)
> from store_sales
> join household_demographics on (ss_hdemo_sk=hd_demo_sk)
> where (ss_sales_price between 0 and 100 and hd_dep_count = 1)
> or (ss_sales_price between 100 and 200 and hd_dep_count = 2);
> {noformat}
> and look at the plan we see that neither table scan has any predicates pushed
> to it, the only filter is in the join
> (from impalad version 2.12.0-cdh5.16.x RELEASE (build
> 3f68649c7bf8a01fb6ba0cbe35dd2492adb836dd)
> {noformat}
> PLAN-ROOT SINK
> |
> 06:AGGREGATE [FINALIZE]
> | output: avg:merge(ss_quantity)
> |
> 05:EXCHANGE [UNPARTITIONED]
> |
> 03:AGGREGATE
> | output: avg(ss_quantity)
> |
> 02:HASH JOIN [INNER JOIN, BROADCAST]
> | hash predicates: ss_hdemo_sk = hd_demo_sk
> | other predicates: (ss_sales_price >= 0 AND ss_sales_price <= 100 AND
> hd_dep_count = 1) OR
> | (ss_sales_price >= 100 AND ss_sales_price <=
> 200 AND hd_dep_count = 2)
> | runtime filters: RF000 <- hd_demo_sk
> |
> |--04:EXCHANGE [BROADCAST]
> | |
> | 01:SCAN HDFS [tpcds_1000_parquet.household_demographics]
> | partitions=1/1 files=1 size=41.08KB
> |
> 00:SCAN HDFS [tpcds_1000_parquet.store_sales]
> partitions=1824/1824 files=1824 size=189.24GB
> runtime filters: RF000 -> ss_hdemo_sk
> {noformat}
> If we look at PostgreSQL 11.1, we'll see that not only does the join filter,
> but the table scan has the appropriate filters pushed to it.
> {noformat}
> Finalize Aggregate (cost=67549.69..67549.70 rows=1 width=32)
> -> Gather (cost=67549.47..67549.68 rows=2 width=32)
> Workers Planned: 2
> -> Partial Aggregate (cost=66549.47..66549.48 rows=1 width=32)
> -> Hash Join (cost=113.12..66547.64 rows=734 width=4)
> Hash Cond: (store_sales.ss_hdemo_sk =
> household_demographics.hd_demo_sk)
> Join Filter: (((store_sales.ss_sales_price >=
> '0'::numeric) AND
> (store_sales.ss_sales_price <=
> '100'::numeric) AND
> (household_demographics.hd_dep_count =
> 1)) OR
> ((store_sales.ss_sales_price >=
> '100'::numeric) AND
> (store_sales.ss_sales_price <=
> '200'::numeric) AND
> (household_demographics.hd_dep_count =
> 2)))
> -> Parallel Seq Scan on store_sales
> (cost=0.00..66343.20 rows=7305 width=22)
> Filter: (((ss_sales_price >= '0'::numeric) AND
> (ss_sales_price <= '100'::numeric)) OR
> ((ss_sales_price >= '100'::numeric) AND
> (ss_sales_price <= '200'::numeric)))
> -> Hash (cost=112.62..112.62 rows=40 width=8)
> -> Seq Scan on household_demographics
> (cost=0.00..112.62 rows=40 width=8)
> Filter: ((hd_dep_count = 1) OR (hd_dep_count
> = 2))
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]