I use another query to reproduce the bug: explain select * from (select t2.a_id,t2.amount1,t2.amount2 from a left outer join ( select c.a_id, amount as amount1, amount as amount2 from b join c on b.b_id = c.b_id) t2 on a.a_id = t2.a_id) t1
The creation of the "amount = amount" predicate should be traced to SingleNodePlanner#migrateConjunctsToInlineView when dealing with inline view t1. It's an optimization that Impala will create some predicates based on slot equivalences (i.e. Analyzer#valueTransferGraph). However, these conjuncts may be identity (i.e. things like a = a) which may incorrectly reject rows with nulls. We already have some logic to remove this kind of conjuncts but the existing checks have exceptions. It's due to the wrong substitution map is used in checking identity predicates. Here's a patch to fix it: https://gerrit.cloudera.org/c/12939/ On Fri, Apr 5, 2019 at 3:40 AM Tim Armstrong <tarmstr...@cloudera.com> wrote: > This also reminds me of https://issues.apache.org/jira/browse/IMPALA-7957 > since there's an extra predicate added after the LEFT JOIN. > > On Thu, Apr 4, 2019 at 8:37 AM Quanlong Huang <huangquanl...@gmail.com> > wrote: > > > Yes. I'm interested in this and going to look deeper in it tomorrow. Just > > filed a JIRA: https://issues.apache.org/jira/browse/IMPALA-8386 > > > > On Thu, Apr 4, 2019 at 11:20 PM Todd Lipcon <t...@cloudera.com> wrote: > > > > > Sounds like we should file a high priority JIRA (any "wrong results" > bugs > > > should probably be considered critical or blocker). Quanlong, any > > interest > > > in working on this issue? > > > > > > -Todd > > > > > > On Thu, Apr 4, 2019 at 8:17 AM Quanlong Huang <huangquanl...@gmail.com > > > > > wrote: > > > > > > > +1 for Csaba's analysis. Looks like similiar to this > > > > https://issues.apache.org/jira/browse/IMPALA-3126 > > > > > > > > On Thu, Apr 4, 2019 at 11:08 PM Csaba Ringhofer < > > > csringho...@cloudera.com> > > > > wrote: > > > > > > > > > Hi! > > > > > > > > > > I have checked the queries, and I can verify that Impala > incorrectly > > > > > returns 1 row while the same query with Hive (or common sense..) > > > returns > > > > 2 > > > > > rows. > > > > > > > > > > > "but if remove the "t2.amount2" like this:" > > > > > Indeed, the issue seems to be related to returning the same > aggregate > > > > twice > > > > > + the fact that one of these values is NULL. The planner > introduces a > > > > > predicate that checks if amount1=amount2, which is false, if both > > > values > > > > > are NULL: > > > > > > > > > > explain select * from (select t2.a_id,t2.amount1,t2.amount2 > > > > > from( select a_id from a) t1 > > > > > left outer join ( > > > > > select c.a_id,sum(amount) as amount1,sum(amount) as amount2 > > > > > from b join c on b.b_id = c.b_id group by c.a_id) t2 > > > > > on t1.a_id = t2.a_id) t; > > > > > > > > > > results in: > > > > > PLAN-ROOT SINK > > > > > | > > > > > 05:HASH JOIN [RIGHT OUTER JOIN] > > > > > | hash predicates: c.a_id = a_id > > > > > | other predicates: sum(amount) = sum(amount) <----- I don't know > > why > > > > this > > > > > predicate is added. > > > > > | runtime filters: RF000 <- a_id > > > > > | row-size=16B cardinality=2 > > > > > .... > > > > > > > > > > If I EXPLAIN the query without the outer select, the sum(amount) = > > > > > sum(amount) is not added, which explains the difference. > > > > > > > > > > I do not know why the planner adds this predicate, my guess is that > > > this > > > > is > > > > > some kind of bug in Impala. > > > > > > > > > > > > > > > On Thu, Apr 4, 2019 at 2:27 PM skyyws <sky...@163.com> wrote: > > > > > > > > > > > Hi all, I met a problem of left outer join recently, and I > > reproduce > > > > this > > > > > > problem by some simple test data with three tables a, b, c: > > > > > > table A > > > > > > +------+ > > > > > > | a_id | > > > > > > +------+ > > > > > > | 1 | > > > > > > | 2 | > > > > > > +------+ > > > > > > table B > > > > > > +------+--------+ > > > > > > | b_id | amount | > > > > > > +------+--------+ > > > > > > | 1 | 10 | > > > > > > | 1 | 20 | > > > > > > | 2 | NULL | > > > > > > +------+--------+ > > > > > > table C > > > > > > +------+------+ > > > > > > | a_id | b_id | > > > > > > +------+------+ > > > > > > | 1 | 1 | > > > > > > | 2 | 2 | > > > > > > +------+------+ > > > > > > The sql below: > > > > > > select count(1) from ( > > > > > > select t2.a_id,t2.amount1,t2.amount2 > > > > > > from( select a_id from a) t1 > > > > > > left outer join ( > > > > > > select c.a_id,sum(amount) as amount1,sum(amount) as > amount2 > > > > > > from b join c on b.b_id = c.b_id group by c.a_id) t2 > > > > > > on t1.a_id = t2.a_id > > > > > > ) t; > > > > > > +----------+ > > > > > > | count(1) | > > > > > > +----------+ > > > > > > | 1 | > > > > > > +----------+ > > > > > > but if remove the "t2.amount2" like this: > > > > > > select count(1) from ( > > > > > > select t2.a_id,t2.amount1 > > > > > > from( select a_id from a) t1 > > > > > > left outer join ( > > > > > > select c.a_id,sum(amount) as amount1,sum(amount) as > amount2 > > > > > > from b join c on b.b_id = c.b_id group by c.a_id) t2 > > > > > > on t1.a_id = t2.a_id > > > > > > ) t; > > > > > > +----------+ > > > > > > | count(1) | > > > > > > +----------+ > > > > > > | 2 | > > > > > > +----------+ > > > > > > Here is the result of two subquery without count(1): > > > > > > +------+---------+---------+ > > > > > > | a_id | amount1 | amount2 | > > > > > > +------+---------+---------+ > > > > > > | 1 | 30 | 30 | > > > > > > | 2 | NULL | NULL | > > > > > > +------+---------+---------+ why the count(1) of this > > > > > > resultset is 1? > > > > > > +------+---------+ > > > > > > | a_id | amount1 | > > > > > > +------+---------+ > > > > > > | 1 | 30 | > > > > > > | 2 | NULL | > > > > > > +------+---------+ why the count(1) of > > this > > > > > > resultset is 2? > > > > > > I want to ask why the first sql return just 1, but second return > > 2,is > > > > > this > > > > > > correct or impala bug?How impala deal with count aggr.? > > > > > > If I change the sum to other aggr. function like count/max/min, > > > result > > > > is > > > > > > same. I test this on 2.12.0 and 3.1.0 version. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Todd Lipcon > > > Software Engineer, Cloudera > > > > > >