Paul Rogers created IMPALA-7997:
-----------------------------------
Summary: Rewrites omitted for ON clause, test case expect wrong
plans
Key: IMPALA-7997
URL: https://issues.apache.org/jira/browse/IMPALA-7997
Project: IMPALA
Issue Type: Bug
Components: Frontend
Affects Versions: Impala 3.1.0
Reporter: Paul Rogers
Assignee: Paul Rogers
Bugs in the rewrite code prevented the {{ON}} clause from being subject to the
full set of rewrite rules. In particular, it appears that the “simplify
conditions” rule was not applied. An effort to fix rewrites modified the code
to correctly apply rewrites for {{ON}}. This exposed a bug in the planner for
certain unrealistic, pathological queries.
{{PlannerTest.empy.test}} has the following test case:
{code:sql}
# Constant conjunct in the ON-clause of an outer join is
# assigned to the join.
select *
from functional.alltypessmall a
right outer join functional.alltypestiny b
on (a.id = b.id and !true)
{code}
Note the {{ON}} clause meaning, which is realized after rewrites:
{noformat}
a.id = b.id and !true --> a.id = b.id AND FALSE --> FALSE
{noformat}
That is, no rows in the join match. Currently expected (incorrect) plan:
{noformat}
PLAN-ROOT SINK
|
02:HASH JOIN [RIGHT OUTER JOIN]
| hash predicates: a.id = b.id
| other join predicates: FALSE
| runtime filters: RF000 <- b.id
|
|--01:SCAN HDFS [functional.alltypestiny b]
| partitions=4/4 files=4 size=460B
|
00:SCAN HDFS [functional.alltypessmall a]
partitions=4/4 files=4 size=6.32KB
runtime filters: RF000 -> a.id
{noformat}
Note the attempt to push the runtime filter for {{b.id}} into the left HDFS
scan, then the query throws away the rows because of the {{FALSE}} join
condition.
Once rewrites are applied, we get a new plan, which is actually worse (see
IMPALA-7996):
{noformat}
PLAN-ROOT SINK
|
02:NESTED LOOP JOIN [RIGHT OUTER JOIN]
| join predicates: FALSE
|
|--01:SCAN HDFS [functional.alltypestiny b]
| partitions=4/4 files=4 size=460B
|
00:SCAN HDFS [functional.alltypessmall a]
partitions=4/4 files=4 size=6.32KB
{noformat}
Similar case:
{code:sql}
# Constant conjunct in the ON-clause of an outer join is
# assigned to the join.
select *
from functional.alltypessmall a
left outer join functional.alltypestiny b
on (a.id = b.id and 1 + 1 > 10)
{code}
Reasoning:
{noformat}
a.id = b.id and 1 + 1 > 10 --> a.id = b.id AND 2 > 10
--> a.id = b.id AND FALSE --> FALSE
{noformat}
The current expected plan:
{noformat}
PLAN-ROOT SINK
|
02:HASH JOIN [LEFT OUTER JOIN]
| hash predicates: a.id = b.id
| other join predicates: FALSE
|
|--01:SCAN HDFS [functional.alltypestiny b]
| partitions=4/4 files=4 size=460B
|
00:SCAN HDFS [functional.alltypessmall a]
partitions=4/4 files=4 size=6.32KB
{noformat}
The revised plan, after rewrites folding:
{noformat}
PLAN-ROOT SINK
|
02:NESTED LOOP JOIN [LEFT OUTER JOIN]
| join predicates: FALSE
|
|--01:SCAN HDFS [functional.alltypestiny b]
| partitions=4/4 files=4 size=460B
|
00:SCAN HDFS [functional.alltypessmall a]
partitions=4/4 files=4 size=6.32KB
{noformat}
The lack of simplification also shows up in {{resource-requirements.test}}
Previous:
{noformat}
Analyzed query: SELECT c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice,
sum(l_quantity) FROM tpch.customer, tpch.orders, tpch.lineitem LEFT SEMI JOIN
(SELECT l_orderkey FROM tpch.lineitem GROUP BY l_orderkey HAVING $ao$1 /*
sum(l_quantity) */ > CAST(300 AS DECIMAL(5,0))) `$a$1` (`$c$1`) ON o_orderkey =
`$a$1`.`$c$1` WHERE TRUE AND c_custkey = o_custkey AND o_orderkey = l_orderkey
GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice ORDER BY
o_totalprice DESC, o_orderdate ASC LIMIT CAST(100 AS TINYINT)
{noformat}
Revised:
{noformat}
Analyzed query: SELECT c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice,
sum(l_quantity) FROM tpch.customer, tpch.orders, tpch.lineitem LEFT SEMI JOIN
(SELECT l_orderkey FROM tpch.lineitem GROUP BY l_orderkey HAVING $ao$1 /*
sum(l_quantity) */ > CAST(300 AS DECIMAL(5,0))) `$a$1` (`$c$1`) ON o_orderkey =
`$a$1`.`$c$1` WHERE c_custkey = o_custkey AND o_orderkey = l_orderkey GROUP BY
c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice ORDER BY o_totalprice
DESC, o_orderdate ASC LIMIT CAST(100 AS TINYINT)
{noformat}
Notice the elimination of {{TRUE AND}} in the revised {{WHERE}} clause.
Also in {{subquery-rewrite.test}}, previous:
{noformat}
| predicates: FALSE OR functional.alltypestiny.int_col IS NULL OR
functional.alltypestiny.int_col = 1
{noformat}
Revised:
{noformat}
| predicates: functional.alltypestiny.int_col IS NULL OR
functional.alltypestiny.int_col = 1
{noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)