Paul Rogers created IMPALA-7997:
-----------------------------------

             Summary: Rewrites omitted for ON clause, test case expect wrong 
plans
                 Key: IMPALA-7997
                 URL: https://issues.apache.org/jira/browse/IMPALA-7997
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
    Affects Versions: Impala 3.1.0
            Reporter: Paul Rogers
            Assignee: Paul Rogers


Bugs in the rewrite code prevented the {{ON}} clause from being subject to the 
full set of rewrite rules. In particular, it appears that the “simplify 
conditions” rule was not applied. An effort to fix rewrites modified the code 
to correctly apply rewrites for {{ON}}. This exposed a bug in the planner for 
certain unrealistic, pathological queries.

{{PlannerTest.empy.test}} has the following test case:

{code:sql}
# Constant conjunct in the ON-clause of an outer join is
# assigned to the join.
select *
from functional.alltypessmall a
right outer join functional.alltypestiny b
on (a.id = b.id and !true)
{code}

Note the {{ON}} clause meaning, which is realized after rewrites:

{noformat}
a.id = b.id and !true --> a.id = b.id AND FALSE --> FALSE
{noformat}

That is, no rows in the join match. Currently expected (incorrect) plan:

{noformat}
PLAN-ROOT SINK
|
02:HASH JOIN [RIGHT OUTER JOIN]
|  hash predicates: a.id = b.id
|  other join predicates: FALSE
|  runtime filters: RF000 <- b.id
|
|--01:SCAN HDFS [functional.alltypestiny b]
|     partitions=4/4 files=4 size=460B
|
00:SCAN HDFS [functional.alltypessmall a]
   partitions=4/4 files=4 size=6.32KB
   runtime filters: RF000 -> a.id
{noformat}

Note the attempt to push the runtime filter for {{b.id}} into the left HDFS 
scan, then the query throws away the rows because of the {{FALSE}} join 
condition.

Once rewrites are applied, we get a new plan, which is actually worse (see 
IMPALA-7996):

{noformat}
PLAN-ROOT SINK
|
02:NESTED LOOP JOIN [RIGHT OUTER JOIN]
|  join predicates: FALSE
|
|--01:SCAN HDFS [functional.alltypestiny b]
|     partitions=4/4 files=4 size=460B
|
00:SCAN HDFS [functional.alltypessmall a]
   partitions=4/4 files=4 size=6.32KB
{noformat}

Similar case:

{code:sql}
# Constant conjunct in the ON-clause of an outer join is
# assigned to the join.
select *
from functional.alltypessmall a
left outer join functional.alltypestiny b
on (a.id = b.id and 1 + 1 > 10)
{code}

Reasoning:

{noformat}
a.id = b.id and 1 + 1 > 10 --> a.id = b.id AND 2 > 10
  --> a.id = b.id AND FALSE --> FALSE
{noformat}

The current expected plan:

{noformat}
PLAN-ROOT SINK
|
02:HASH JOIN [LEFT OUTER JOIN]
|  hash predicates: a.id = b.id
|  other join predicates: FALSE
|
|--01:SCAN HDFS [functional.alltypestiny b]
|     partitions=4/4 files=4 size=460B
|
00:SCAN HDFS [functional.alltypessmall a]
   partitions=4/4 files=4 size=6.32KB
{noformat}

The revised plan, after rewrites folding:

{noformat}
PLAN-ROOT SINK
|
02:NESTED LOOP JOIN [LEFT OUTER JOIN]
|  join predicates: FALSE
|
|--01:SCAN HDFS [functional.alltypestiny b]
|     partitions=4/4 files=4 size=460B
|
00:SCAN HDFS [functional.alltypessmall a]
   partitions=4/4 files=4 size=6.32KB
{noformat}

The lack of simplification also shows up in {{resource-requirements.test}}

Previous:

{noformat}
Analyzed query: SELECT c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice,
sum(l_quantity) FROM tpch.customer, tpch.orders, tpch.lineitem LEFT SEMI JOIN
(SELECT l_orderkey FROM tpch.lineitem GROUP BY l_orderkey HAVING $ao$1 /*
sum(l_quantity) */ > CAST(300 AS DECIMAL(5,0))) `$a$1` (`$c$1`) ON o_orderkey =
`$a$1`.`$c$1` WHERE TRUE AND c_custkey = o_custkey AND o_orderkey = l_orderkey
GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice ORDER BY
o_totalprice DESC, o_orderdate ASC LIMIT CAST(100 AS TINYINT)
{noformat}

Revised:

{noformat}
Analyzed query: SELECT c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice,
sum(l_quantity) FROM tpch.customer, tpch.orders, tpch.lineitem LEFT SEMI JOIN
(SELECT l_orderkey FROM tpch.lineitem GROUP BY l_orderkey HAVING $ao$1 /*
sum(l_quantity) */ > CAST(300 AS DECIMAL(5,0))) `$a$1` (`$c$1`) ON o_orderkey =
`$a$1`.`$c$1` WHERE c_custkey = o_custkey AND o_orderkey = l_orderkey GROUP BY
c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice ORDER BY o_totalprice
DESC, o_orderdate ASC LIMIT CAST(100 AS TINYINT)
{noformat}

Notice the elimination of {{TRUE AND}} in the revised {{WHERE}} clause.

Also in {{subquery-rewrite.test}}, previous:

{noformat}
|     predicates: FALSE OR functional.alltypestiny.int_col IS NULL OR 
functional.alltypestiny.int_col = 1
{noformat}

Revised:

{noformat}
|     predicates: functional.alltypestiny.int_col IS NULL OR 
functional.alltypestiny.int_col = 1
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to