[ 
https://issues.apache.org/jira/browse/IMPALA-8386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825582#comment-16825582
 ] 

ASF subversion and git services commented on IMPALA-8386:
---------------------------------------------------------

Commit f25a899924856f705ecb581b237a149003279473 in impala's branch 
refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=f25a899 ]

IMPALA-8386: Fix incorrect equivalence conjuncts not treated as identity

When generating single node plans for inline views, Impala will create
some equivalence conjuncts based on slot equivalences. However, these
conjuncts may finally be substituted to identity (e.g. a = a) which may
incorrectly reject rows with nulls. We already have some logic to remove
this kind of conjuncts but the existing checks have exceptions.

For example, consider the following tables and a query:
table A        table B            table C
+------+  +------+--------+  +------+------+
| a_id |  | b_id | amount |  | a_id | b_id |
+------+  +------+--------+  +------+------+
| 1    |  | 1    | 10     |  | 1    | 1    |
| 2    |  | 1    | 20     |  | 2    | 2    |
+------+  | 2    | NULL   |  +------+------+
          +------+--------+
    select * from (select t2.a_id, t2.amount1, t2.amount2
        from a
        left outer join (
            select c.a_id, amount as amount1, amount as amount2
            from b join c on b.b_id = c.b_id
        ) t2
        on a.a_id = t2.a_id
    ) t1;

They query has 11 slots. The valueTransferGraph (slot equivalences) has
3 strongly connected components:
 * {slot0 (b.b_id), slot1 (c.b_id)}
 * {slot2 (c.a_id), slot4 (t2.a_id), slot8 (t1.a_id)}
 * {slot3 (b.amount), slot5 (t2.amount1), slot6 (t2.amount2),
slot9 (t1.amount1), slot10 (t1.amount2)}

In SingleNodePlanner#migrateConjunctsToInlineView, when dealing with
inline view t1, a predicate "t1.amount1 = t1.amount2" will first be
created by Analyzer#createEquivConjuncts, then be substituted using the
smap_ of the inline view and become "t2.amount1 = t2.amount2". It can
still pass the IdentityPredicate check. However, the substituted one
will finally be resolved to "amount = amount" and be assigned to the
left outer join node. So nulls are incorrectly rejected.

Actually, when checking IdentityPredicates, we need to check the final
resolved version of them using base table slots (baseTblSmap_). So the
predicate "t1.amount1 = t1.amount2" will be resolved to "amount = amount"
and won't pass the IdentityPredicate check.

Tests:
 * Add plan tests in PlannerTest/inline-view.test
 * Run all tests locally in CORE exploration strategy

Change-Id: Ia87aa9db2de85f0716e4854a88727aad593773fa
Reviewed-on: http://gerrit.cloudera.org:8080/12939
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Incorrect predicate in a left outer join query
> ----------------------------------------------
>
>                 Key: IMPALA-8386
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8386
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 2.12.0, Impala 3.2.0
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Critical
>              Labels: correctness
>
> skyyws <[email protected]> reported a bug [in the mailing 
> list|https://lists.apache.org/thread.html/0bdbbaa6bb35b552f050ae30587b7d75b78a72ec60007a8bc0a4a8a9@%3Cdev.impala.apache.org%3E]
>  on the following data set:
> {code:java}
> table A
> +------+
> | a_id |
> +------+
> | 1    |
> | 2    |
> +------+
> table B
> +------+--------+
> | b_id | amount |
> +------+--------+
> | 1    | 10     |
> | 1    | 20     |
> | 2    | NULL   |
> +------+--------+
> table C
> +------+------+
> | a_id | b_id |
> +------+------+
> | 1    | 1    |
> | 2    | 2    |
> +------+------+{code}
> The following query returns a wrong result "1":
> {code:java}
> select count(1) from (
>     select t2.a_id,t2.amount1,t2.amount2
>     from( select a_id from a) t1
>     left outer join (
>         select c.a_id,sum(amount) as amount1,sum(amount) as amount2
>         from b join c  on b.b_id = c.b_id group by c.a_id) t2
>     on t1.a_id = t2.a_id
> ) t;
> {code}
> Removing "t2.amount2" can get the right result "2":
> {code:java}
> select count(1) from (
>     select t2.a_id,t2.amount1
>     from( select a_id from a) t1
>     left outer join (
>         select c.a_id,sum(amount) as amount1,sum(amount) as amount2
>         from b join c  on b.b_id = c.b_id group by c.a_id) t2
>     on t1.a_id = t2.a_id
> ) t;
> {code}
> The problem is that in query 1, Impala generates a wrong predicate 
> "sum(amount) = sum(amount)" which rejects nulls:
> {code:java}
> +-------------------------------------------------------------+
> | Explain String                                              |
> +-------------------------------------------------------------+
> | Max Per-Host Resource Reservation: Memory=10.84MB Threads=8 |
> | Per-Host Resource Estimates: Memory=143MB                   |
> | Codegen disabled by planner                                 |
> |                                                             |
> | PLAN-ROOT SINK                                              |
> | |                                                           |
> | 12:AGGREGATE [FINALIZE]                                     |
> | |  output: count:merge(*)                                   |
> | |  row-size=8B cardinality=1                                |
> | |                                                           |
> | 11:EXCHANGE [UNPARTITIONED]                                 |
> | |                                                           |
> | 06:AGGREGATE                                                |
> | |  output: count(*)                                         |
> | |  row-size=8B cardinality=1                                |
> | |                                                           |
> | 05:HASH JOIN [RIGHT OUTER JOIN, PARTITIONED]                |
> | |  hash predicates: c.a_id = a_id                           |
> | |  other predicates: sum(amount) = sum(amount)      <---------- Wrong 
> inferred predicate which incorrectly reject nulls
> | |  runtime filters: RF000 <- a_id                           |
> | |  row-size=16B cardinality=2                               |
> | |                                                           |
> | |--10:EXCHANGE [HASH(a_id)]                                 |
> | |  |                                                        |
> | |  00:SCAN HDFS [default.a]                                 |
> | |     partitions=1/1 files=1 size=4B                        |
> | |     row-size=4B cardinality=2                             |
> | |                                                           |
> | 09:AGGREGATE [FINALIZE]                                     |
> | |  output: sum:merge(amount)                                |
> | |  group by: c.a_id                                         |
> | |  row-size=12B cardinality=2                               |
> | |                                                           |
> | 08:EXCHANGE [HASH(c.a_id)]                                  |
> | |                                                           |
> | 04:AGGREGATE [STREAMING]                                    |
> | |  output: sum(amount)                                      |
> | |  group by: c.a_id                                         |
> | |  row-size=12B cardinality=2                               |
> | |                                                           |
> | 03:HASH JOIN [INNER JOIN, BROADCAST]                        |
> | |  hash predicates: b.b_id = c.b_id                         |
> | |  runtime filters: RF002 <- c.b_id                         |
> | |  row-size=16B cardinality=3                               |
> | |                                                           |
> | |--07:EXCHANGE [BROADCAST]                                  |
> | |  |                                                        |
> | |  02:SCAN HDFS [default.c]                                 |
> | |     partitions=1/1 files=1 size=8B                        |
> | |     runtime filters: RF000 -> default.c.a_id              |
> | |     row-size=8B cardinality=2                             |
> | |                                                           |
> | 01:SCAN HDFS [default.b]                                    |
> |    partitions=1/1 files=1 size=15B                          |
> |    runtime filters: RF002 -> b.b_id                         |
> |    row-size=8B cardinality=3                                |
> +-------------------------------------------------------------+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to