Hi! I have checked the queries, and I can verify that Impala incorrectly returns 1 row while the same query with Hive (or common sense..) returns 2 rows.
> "but if remove the "t2.amount2" like this:" Indeed, the issue seems to be related to returning the same aggregate twice + the fact that one of these values is NULL. The planner introduces a predicate that checks if amount1=amount2, which is false, if both values are NULL: explain select * from (select t2.a_id,t2.amount1,t2.amount2 from( select a_id from a) t1 left outer join ( select c.a_id,sum(amount) as amount1,sum(amount) as amount2 from b join c on b.b_id = c.b_id group by c.a_id) t2 on t1.a_id = t2.a_id) t; results in: PLAN-ROOT SINK | 05:HASH JOIN [RIGHT OUTER JOIN] | hash predicates: c.a_id = a_id | other predicates: sum(amount) = sum(amount) <----- I don't know why this predicate is added. | runtime filters: RF000 <- a_id | row-size=16B cardinality=2 .... If I EXPLAIN the query without the outer select, the sum(amount) = sum(amount) is not added, which explains the difference. I do not know why the planner adds this predicate, my guess is that this is some kind of bug in Impala. On Thu, Apr 4, 2019 at 2:27 PM skyyws <sky...@163.com> wrote: > Hi all, I met a problem of left outer join recently, and I reproduce this > problem by some simple test data with three tables a, b, c: > table A > +------+ > | a_id | > +------+ > | 1 | > | 2 | > +------+ > table B > +------+--------+ > | b_id | amount | > +------+--------+ > | 1 | 10 | > | 1 | 20 | > | 2 | NULL | > +------+--------+ > table C > +------+------+ > | a_id | b_id | > +------+------+ > | 1 | 1 | > | 2 | 2 | > +------+------+ > The sql below: > select count(1) from ( > select t2.a_id,t2.amount1,t2.amount2 > from( select a_id from a) t1 > left outer join ( > select c.a_id,sum(amount) as amount1,sum(amount) as amount2 > from b join c on b.b_id = c.b_id group by c.a_id) t2 > on t1.a_id = t2.a_id > ) t; > +----------+ > | count(1) | > +----------+ > | 1 | > +----------+ > but if remove the "t2.amount2" like this: > select count(1) from ( > select t2.a_id,t2.amount1 > from( select a_id from a) t1 > left outer join ( > select c.a_id,sum(amount) as amount1,sum(amount) as amount2 > from b join c on b.b_id = c.b_id group by c.a_id) t2 > on t1.a_id = t2.a_id > ) t; > +----------+ > | count(1) | > +----------+ > | 2 | > +----------+ > Here is the result of two subquery without count(1): > +------+---------+---------+ > | a_id | amount1 | amount2 | > +------+---------+---------+ > | 1 | 30 | 30 | > | 2 | NULL | NULL | > +------+---------+---------+ why the count(1) of this > resultset is 1? > +------+---------+ > | a_id | amount1 | > +------+---------+ > | 1 | 30 | > | 2 | NULL | > +------+---------+ why the count(1) of this > resultset is 2? > I want to ask why the first sql return just 1, but second return 2,is this > correct or impala bug?How impala deal with count aggr.? > If I change the sum to other aggr. function like count/max/min, result is > same. I test this on 2.12.0 and 3.1.0 version. > >