I use another query to reproduce the bug:
explain select * from (select t2.a_id,t2.amount1,t2.amount2
    from a
    left outer join (
        select c.a_id, amount as amount1, amount as amount2
        from b join c  on b.b_id = c.b_id) t2
    on a.a_id = t2.a_id) t1

The creation of the "amount = amount" predicate should be traced to
SingleNodePlanner#migrateConjunctsToInlineView when dealing with inline
view t1. It's an optimization that Impala will create some predicates based
on slot equivalences (i.e. Analyzer#valueTransferGraph). However, these
conjuncts may be identity (i.e. things like a = a) which may incorrectly
reject rows with nulls. We already have some logic to remove this kind of
conjuncts but the existing checks have exceptions. It's due to the wrong
substitution map is used in checking identity predicates.

Here's a patch to fix it: https://gerrit.cloudera.org/c/12939/

On Fri, Apr 5, 2019 at 3:40 AM Tim Armstrong <tarmstr...@cloudera.com>
wrote:

> This also reminds me of https://issues.apache.org/jira/browse/IMPALA-7957
> since there's an extra predicate added after the LEFT JOIN.
>
> On Thu, Apr 4, 2019 at 8:37 AM Quanlong Huang <huangquanl...@gmail.com>
> wrote:
>
> > Yes. I'm interested in this and going to look deeper in it tomorrow. Just
> > filed a JIRA: https://issues.apache.org/jira/browse/IMPALA-8386
> >
> > On Thu, Apr 4, 2019 at 11:20 PM Todd Lipcon <t...@cloudera.com> wrote:
> >
> > > Sounds like we should file a high priority JIRA (any "wrong results"
> bugs
> > > should probably be considered critical or blocker). Quanlong, any
> > interest
> > > in working on this issue?
> > >
> > > -Todd
> > >
> > > On Thu, Apr 4, 2019 at 8:17 AM Quanlong Huang <huangquanl...@gmail.com
> >
> > > wrote:
> > >
> > > > +1 for Csaba's analysis. Looks like similiar to this
> > > > https://issues.apache.org/jira/browse/IMPALA-3126
> > > >
> > > > On Thu, Apr 4, 2019 at 11:08 PM Csaba Ringhofer <
> > > csringho...@cloudera.com>
> > > > wrote:
> > > >
> > > > > Hi!
> > > > >
> > > > > I have checked the queries, and I can verify that Impala
> incorrectly
> > > > > returns 1 row while the same query with Hive (or common sense..)
> > > returns
> > > > 2
> > > > > rows.
> > > > >
> > > > > > "but if remove the "t2.amount2"  like this:"
> > > > > Indeed, the issue seems to be related to returning the same
> aggregate
> > > > twice
> > > > > + the fact that one of these values is NULL. The planner
> introduces a
> > > > > predicate that checks if amount1=amount2, which is false, if both
> > > values
> > > > > are NULL:
> > > > >
> > > > > explain select * from (select t2.a_id,t2.amount1,t2.amount2
> > > > >     from( select a_id from a) t1
> > > > >     left outer join (
> > > > >         select c.a_id,sum(amount) as amount1,sum(amount) as amount2
> > > > >         from b join c  on b.b_id = c.b_id group by c.a_id) t2
> > > > >     on t1.a_id = t2.a_id) t;
> > > > >
> > > > > results in:
> > > > > PLAN-ROOT SINK
> > > > > |
> > > > > 05:HASH JOIN [RIGHT OUTER JOIN]
> > > > > |  hash predicates: c.a_id = a_id
> > > > > |  other predicates: sum(amount) = sum(amount) <----- I don't know
> > why
> > > > this
> > > > > predicate is added.
> > > > > |  runtime filters: RF000 <- a_id
> > > > > |  row-size=16B cardinality=2
> > > > > ....
> > > > >
> > > > > If I EXPLAIN the query without the outer select, the sum(amount) =
> > > > > sum(amount) is not added, which explains the difference.
> > > > >
> > > > > I do not know why the planner adds this predicate, my guess is that
> > > this
> > > > is
> > > > > some kind of bug in Impala.
> > > > >
> > > > >
> > > > > On Thu, Apr 4, 2019 at 2:27 PM skyyws <sky...@163.com> wrote:
> > > > >
> > > > > > Hi all, I met a problem of left outer join recently, and I
> > reproduce
> > > > this
> > > > > > problem by some simple test data with three tables a, b, c:
> > > > > > table A
> > > > > > +------+
> > > > > > | a_id |
> > > > > > +------+
> > > > > > | 1    |
> > > > > > | 2    |
> > > > > > +------+
> > > > > > table B
> > > > > > +------+--------+
> > > > > > | b_id | amount |
> > > > > > +------+--------+
> > > > > > | 1    | 10     |
> > > > > > | 1    | 20     |
> > > > > > | 2    | NULL   |
> > > > > > +------+--------+
> > > > > > table C
> > > > > > +------+------+
> > > > > > | a_id | b_id |
> > > > > > +------+------+
> > > > > > | 1    | 1    |
> > > > > > | 2    | 2    |
> > > > > > +------+------+
> > > > > > The sql below:
> > > > > > select count(1) from (
> > > > > >     select t2.a_id,t2.amount1,t2.amount2
> > > > > >     from( select a_id from a) t1
> > > > > >     left outer join (
> > > > > >         select c.a_id,sum(amount) as amount1,sum(amount) as
> amount2
> > > > > >         from b join c  on b.b_id = c.b_id group by c.a_id) t2
> > > > > >     on t1.a_id = t2.a_id
> > > > > > ) t;
> > > > > > +----------+
> > > > > > | count(1) |
> > > > > > +----------+
> > > > > > | 1        |
> > > > > > +----------+
> > > > > > but if remove the "t2.amount2"  like this:
> > > > > > select count(1) from (
> > > > > >     select t2.a_id,t2.amount1
> > > > > >     from( select a_id from a) t1
> > > > > >     left outer join (
> > > > > >         select c.a_id,sum(amount) as amount1,sum(amount) as
> amount2
> > > > > >         from b join c  on b.b_id = c.b_id group by c.a_id) t2
> > > > > >     on t1.a_id = t2.a_id
> > > > > > ) t;
> > > > > > +----------+
> > > > > > | count(1) |
> > > > > > +----------+
> > > > > > | 2        |
> > > > > > +----------+
> > > > > > Here is the result of two subquery without count(1):
> > > > > > +------+---------+---------+
> > > > > > | a_id | amount1 | amount2 |
> > > > > > +------+---------+---------+
> > > > > > | 1    | 30      | 30      |
> > > > > > | 2    | NULL    | NULL    |
> > > > > > +------+---------+---------+             why the count(1) of this
> > > > > > resultset is 1?
> > > > > > +------+---------+
> > > > > > | a_id | amount1 |
> > > > > > +------+---------+
> > > > > > | 1    | 30      |
> > > > > > | 2    | NULL    |
> > > > > > +------+---------+                           why the count(1) of
> > this
> > > > > > resultset is 2?
> > > > > > I want to ask why the first sql return just 1, but second return
> > 2,is
> > > > > this
> > > > > > correct or impala bug?How impala deal with count aggr.?
> > > > > > If I change the sum to other aggr. function like count/max/min,
> > > result
> > > > is
> > > > > > same. I test this on 2.12.0 and 3.1.0 version.
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Todd Lipcon
> > > Software Engineer, Cloudera
> > >
> >
>

Reply via email to