+1 for Csaba's analysis. Looks like similiar to this https://issues.apache.org/jira/browse/IMPALA-3126
On Thu, Apr 4, 2019 at 11:08 PM Csaba Ringhofer <csringho...@cloudera.com> wrote: > Hi! > > I have checked the queries, and I can verify that Impala incorrectly > returns 1 row while the same query with Hive (or common sense..) returns 2 > rows. > > > "but if remove the "t2.amount2" like this:" > Indeed, the issue seems to be related to returning the same aggregate twice > + the fact that one of these values is NULL. The planner introduces a > predicate that checks if amount1=amount2, which is false, if both values > are NULL: > > explain select * from (select t2.a_id,t2.amount1,t2.amount2 > from( select a_id from a) t1 > left outer join ( > select c.a_id,sum(amount) as amount1,sum(amount) as amount2 > from b join c on b.b_id = c.b_id group by c.a_id) t2 > on t1.a_id = t2.a_id) t; > > results in: > PLAN-ROOT SINK > | > 05:HASH JOIN [RIGHT OUTER JOIN] > | hash predicates: c.a_id = a_id > | other predicates: sum(amount) = sum(amount) <----- I don't know why this > predicate is added. > | runtime filters: RF000 <- a_id > | row-size=16B cardinality=2 > .... > > If I EXPLAIN the query without the outer select, the sum(amount) = > sum(amount) is not added, which explains the difference. > > I do not know why the planner adds this predicate, my guess is that this is > some kind of bug in Impala. > > > On Thu, Apr 4, 2019 at 2:27 PM skyyws <sky...@163.com> wrote: > > > Hi all, I met a problem of left outer join recently, and I reproduce this > > problem by some simple test data with three tables a, b, c: > > table A > > +------+ > > | a_id | > > +------+ > > | 1 | > > | 2 | > > +------+ > > table B > > +------+--------+ > > | b_id | amount | > > +------+--------+ > > | 1 | 10 | > > | 1 | 20 | > > | 2 | NULL | > > +------+--------+ > > table C > > +------+------+ > > | a_id | b_id | > > +------+------+ > > | 1 | 1 | > > | 2 | 2 | > > +------+------+ > > The sql below: > > select count(1) from ( > > select t2.a_id,t2.amount1,t2.amount2 > > from( select a_id from a) t1 > > left outer join ( > > select c.a_id,sum(amount) as amount1,sum(amount) as amount2 > > from b join c on b.b_id = c.b_id group by c.a_id) t2 > > on t1.a_id = t2.a_id > > ) t; > > +----------+ > > | count(1) | > > +----------+ > > | 1 | > > +----------+ > > but if remove the "t2.amount2" like this: > > select count(1) from ( > > select t2.a_id,t2.amount1 > > from( select a_id from a) t1 > > left outer join ( > > select c.a_id,sum(amount) as amount1,sum(amount) as amount2 > > from b join c on b.b_id = c.b_id group by c.a_id) t2 > > on t1.a_id = t2.a_id > > ) t; > > +----------+ > > | count(1) | > > +----------+ > > | 2 | > > +----------+ > > Here is the result of two subquery without count(1): > > +------+---------+---------+ > > | a_id | amount1 | amount2 | > > +------+---------+---------+ > > | 1 | 30 | 30 | > > | 2 | NULL | NULL | > > +------+---------+---------+ why the count(1) of this > > resultset is 1? > > +------+---------+ > > | a_id | amount1 | > > +------+---------+ > > | 1 | 30 | > > | 2 | NULL | > > +------+---------+ why the count(1) of this > > resultset is 2? > > I want to ask why the first sql return just 1, but second return 2,is > this > > correct or impala bug?How impala deal with count aggr.? > > If I change the sum to other aggr. function like count/max/min, result is > > same. I test this on 2.12.0 and 3.1.0 version. > > > > >