Re: A problem of left outer join when statement contains two aggregation for the same column

Tim Armstrong Thu, 04 Apr 2019 12:40:55 -0700

This also reminds me of https://issues.apache.org/jira/browse/IMPALA-7957
since there's an extra predicate added after the LEFT JOIN.


On Thu, Apr 4, 2019 at 8:37 AM Quanlong Huang <huangquanl...@gmail.com>
wrote:

> Yes. I'm interested in this and going to look deeper in it tomorrow. Just
> filed a JIRA: https://issues.apache.org/jira/browse/IMPALA-8386
>
> On Thu, Apr 4, 2019 at 11:20 PM Todd Lipcon <t...@cloudera.com> wrote:
>
> > Sounds like we should file a high priority JIRA (any "wrong results" bugs
> > should probably be considered critical or blocker). Quanlong, any
> interest
> > in working on this issue?
> >
> > -Todd
> >
> > On Thu, Apr 4, 2019 at 8:17 AM Quanlong Huang <huangquanl...@gmail.com>
> > wrote:
> >
> > > +1 for Csaba's analysis. Looks like similiar to this
> > > https://issues.apache.org/jira/browse/IMPALA-3126
> > >
> > > On Thu, Apr 4, 2019 at 11:08 PM Csaba Ringhofer <
> > csringho...@cloudera.com>
> > > wrote:
> > >
> > > > Hi!
> > > >
> > > > I have checked the queries, and I can verify that Impala incorrectly
> > > > returns 1 row while the same query with Hive (or common sense..)
> > returns
> > > 2
> > > > rows.
> > > >
> > > > > "but if remove the "t2.amount2"  like this："
> > > > Indeed, the issue seems to be related to returning the same aggregate
> > > twice
> > > > + the fact that one of these values is NULL. The planner introduces a
> > > > predicate that checks if amount1=amount2, which is false, if both
> > values
> > > > are NULL:
> > > >
> > > > explain select * from (select t2.a_id,t2.amount1,t2.amount2
> > > >     from( select a_id from a) t1
> > > >     left outer join (
> > > >         select c.a_id,sum(amount) as amount1,sum(amount) as amount2
> > > >         from b join c  on b.b_id = c.b_id group by c.a_id) t2
> > > >     on t1.a_id = t2.a_id) t;
> > > >
> > > > results in:
> > > > PLAN-ROOT SINK
> > > > |
> > > > 05:HASH JOIN [RIGHT OUTER JOIN]
> > > > |  hash predicates: c.a_id = a_id
> > > > |  other predicates: sum(amount) = sum(amount) <----- I don't know
> why
> > > this
> > > > predicate is added.
> > > > |  runtime filters: RF000 <- a_id
> > > > |  row-size=16B cardinality=2
> > > > ....
> > > >
> > > > If I EXPLAIN the query without the outer select, the sum(amount) =
> > > > sum(amount) is not added, which explains the difference.
> > > >
> > > > I do not know why the planner adds this predicate, my guess is that
> > this
> > > is
> > > > some kind of bug in Impala.
> > > >
> > > >
> > > > On Thu, Apr 4, 2019 at 2:27 PM skyyws <sky...@163.com> wrote:
> > > >
> > > > > Hi all, I met a problem of left outer join recently, and I
> reproduce
> > > this
> > > > > problem by some simple test data with three tables a, b, c：
> > > > > table A
> > > > > +------+
> > > > > | a_id |
> > > > > +------+
> > > > > | 1    |
> > > > > | 2    |
> > > > > +------+
> > > > > table B
> > > > > +------+--------+
> > > > > | b_id | amount |
> > > > > +------+--------+
> > > > > | 1    | 10     |
> > > > > | 1    | 20     |
> > > > > | 2    | NULL   |
> > > > > +------+--------+
> > > > > table C
> > > > > +------+------+
> > > > > | a_id | b_id |
> > > > > +------+------+
> > > > > | 1    | 1    |
> > > > > | 2    | 2    |
> > > > > +------+------+
> > > > > The sql below：
> > > > > select count(1) from (
> > > > >     select t2.a_id,t2.amount1,t2.amount2
> > > > >     from( select a_id from a) t1
> > > > >     left outer join (
> > > > >         select c.a_id,sum(amount) as amount1,sum(amount) as amount2
> > > > >         from b join c  on b.b_id = c.b_id group by c.a_id) t2
> > > > >     on t1.a_id = t2.a_id
> > > > > ) t;
> > > > > +----------+
> > > > > | count(1) |
> > > > > +----------+
> > > > > | 1        |
> > > > > +----------+
> > > > > but if remove the "t2.amount2"  like this：
> > > > > select count(1) from (
> > > > >     select t2.a_id,t2.amount1
> > > > >     from( select a_id from a) t1
> > > > >     left outer join (
> > > > >         select c.a_id,sum(amount) as amount1,sum(amount) as amount2
> > > > >         from b join c  on b.b_id = c.b_id group by c.a_id) t2
> > > > >     on t1.a_id = t2.a_id
> > > > > ) t;
> > > > > +----------+
> > > > > | count(1) |
> > > > > +----------+
> > > > > | 2        |
> > > > > +----------+
> > > > > Here is the result of two subquery without count(1)：
> > > > > +------+---------+---------+
> > > > > | a_id | amount1 | amount2 |
> > > > > +------+---------+---------+
> > > > > | 1    | 30      | 30      |
> > > > > | 2    | NULL    | NULL    |
> > > > > +------+---------+---------+             why the count(1) of this
> > > > > resultset is 1?
> > > > > +------+---------+
> > > > > | a_id | amount1 |
> > > > > +------+---------+
> > > > > | 1    | 30      |
> > > > > | 2    | NULL    |
> > > > > +------+---------+                           why the count(1) of
> this
> > > > > resultset is 2?
> > > > > I want to ask why the first sql return just 1， but second return
> 2，is
> > > > this
> > > > > correct or impala bug？How impala deal with count aggr.?
> > > > > If I change the sum to other aggr. function like count/max/min,
> > result
> > > is
> > > > > same. I test this on 2.12.0 and 3.1.0 version.
> > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>

Re: A problem of left outer join when statement contains two aggregation for the same column

Reply via email to