Re: A problem of left outer join when statement contains two aggregation for the same column

Tim Armstrong Wed, 17 Apr 2019 12:48:50 -0700

Thank you for diving into this and the excellent descriptions of the
problem. I left comments on the first review. I think I could be convinced
that it's correct with a couple more tests.


I hope to get to the second review soon. I think it requires me to
understand the conjunct assignment in a bit more detail :)

On Wed, Apr 17, 2019 at 9:45 AM Quanlong Huang <[email protected]>
wrote:

> After examining more codes, I have some updates for these two bugs:
>
> --------------------- IMPALA-8386 ---------------------
>
> >Predicates inferred by only "alias predicates" should not be created. They
> will definitely introduce identity predicates. Although we filter them out
> in SingleNodePlanner#migrateConjunctsToInlineView, the filter logics have
> exceptions which cause IMPALA-8386. My patch
> https://gerrit.cloudera.org/c/12939 can just fix the exceptions. It'd be
> better if we could prevent suck kind of predicates being generated.
>
> Predicates created from Analyzer#createEquivConjuncts won't be identities
> directly. But they can be substituted into identities. The substitution
> only happens in SingleNodePlanner#migrateConjunctsToInlineView where we
> migrate the created predicates plus those unassigned predicates into the
> inline view. So we just need to check potential identity predicates in
> SingleNodePlanner#migrateConjunctsToInlineView after calling
> Analyzer#createEquivConjuncts. For the checking, we need to resolve those
> predicates into base table slots predicates. So I think patch
> https://gerrit.cloudera.org/c/12939 is sufficient to fix IMPALA-8386.
>
> --------------------- IMPALA-7957 ---------------------
>
> >Predicates inferred by only inner scope predicates should not be assigned
> to the outer scope (the opposite direction is ok and act as predicate
> pushdown). This causes IMPALA-7957.
>
> My statement in the last email is wrong. The problem is that slot
> equivalences are enforced several times. Take this simpler query to
> reproduce the bug:
> SELECT a.id
> FROM functional.alltypestiny a
> LEFT JOIN
>   (SELECT id, int_col
>    FROM functional.alltypestiny b
>    WHERE int_col = id) t
> ON (a.id = t.id)
> UNION ALL
> VALUES (NULL)
>
> Without the "UNION ALL VALUES (NULL)" clause, the plan is correct. With the
> UNION clause, the plan has a SELECT node executing the predicate "b.id =
> b.int_col" on top of the LEFT JOIN node, which wrongly rejects rows
> containing nulls:
>
> +------------------------------------------------------------+
>
> | Explain String                                             |
>
> +------------------------------------------------------------+
>
> | Max Per-Host Resource Reservation: Memory=1.95MB Threads=5 |
>
> | Per-Host Resource Estimates: Memory=162MB                  |
>
> | Codegen disabled by planner                                |
>
> |                                                            |
>
> | PLAN-ROOT SINK                                             |
>
> | |                                                          |
>
> | 06:EXCHANGE [UNPARTITIONED]                                |
>
> | |                                                          |
>
> | 00:UNION                                                   |
>
> | |  constant-operands=1                                     |
>
> | |  row-size=4B cardinality=2                               |
>
> | |                                                          |
>
> | 04:SELECT                                                  |
>
> | |  predicates: b.id = b.int_col      <-------------------------- Wrong
> here!
>
> | |  row-size=12B cardinality=1                              |
>
> | |                                                          |
>
> | 03:HASH JOIN [LEFT OUTER JOIN, BROADCAST]                  |
>
> | |  hash predicates: a.id = id                              |
>
> | |  row-size=12B cardinality=8                              |
>
> | |                                                          |
>
> | |--05:EXCHANGE [BROADCAST]                                 |
>
> | |  |                                                       |
>
> | |  02:SCAN HDFS [functional.alltypestiny b]                |
>
> | |     partitions=4/4 files=4 size=460B                     |
>
> | |     predicates: int_col = id                             |
>
> | |     row-size=8B cardinality=1                            |
>
> | |                                                          |
>
> | 01:SCAN HDFS [functional.alltypestiny a]                   |
>
> |    partitions=4/4 files=4 size=460B                        |
>
> |    row-size=4B cardinality=8                               |
>
> +------------------------------------------------------------+
>
> Here are how this happen:
> * After semantic analysis, there're 4 predicates. Two of them are from the
> query: "b.int_col = b.id", "a.id = t.id". The other two are auxiliary
> predicates generated for slot equavilance: "t.id = b.id", "t.int_col =
> b.int_col".
> * When migrating conjuncts into inline view t (happens in
> SingleNodePlanner#migrateConjunctsToInlineView), a predicate "t.id =
> t.int_col" is inferred. Then it's substituted (migrated) to "b.id =
> b.int_col" and assigned to the scan node for b. When creating this scan
> node, the redundant inferred predicate "b.id = b.int_col" is removed.
> (It's
> redundant to the origin predicate "b.int_col = b.id").
> * At the last step of creating the UNION operand for the JOIN side, if
> there are still unassigned predicates, or we can still inferred some
> predicates, the planner will create a SELECT node with those predicates on
> top of the UNION operand. (See SingleNodePlanner#addUnassignedConjuncts).
> Unfortunately, a predicate "b.id = b.int_col" is inferred again here thus
> creating the incorrect SELECT node.
>
> I just uploaded a patch for review: https://gerrit.cloudera.org/c/13051 .
> The trace logs it adds can help to reveal this.
>
> Can anyone help to review these two patches?
> * https://gerrit.cloudera.org/c/12939
> * https://gerrit.cloudera.org/c/13051
>
> Thanks,
> Quanlong
>
> On Mon, Apr 8, 2019 at 12:02 PM Quanlong Huang <[email protected]>
> wrote:
>
> > Hi Paul,
> >
> > I think you were tracking
> > https://issues.apache.org/jira/browse/IMPALA-7957 where the predicate
> "c1
> > = c2" in the WHERE clause of inline view t2 is wrongly duplicated to an
> > upper SELECT node. Its cause differs from IMPALA-8386 in where a wrong
> > identity predicate "amount = amount" is generated and wrongly rejects
> > nulls. However, while digging deeper, I agree with you that the root
> cause
> > of these two cases may be the same: we should distinguish different kinds
> > of predicates and use them carefully.
> >
> > Take this query as an example:
> >
> > select 1
> > from functional.alltypes a
> > inner join
> >   (select id+id as x, tinyint_col,
> >           int_col*int_col as y, bigint_col
> >    from functional.alltypessmall) b
> > on a.id = b.x and a.id = b.tinyint_col and
> >    a.int_col = b.y and a.int_col = b.bigint_col
> >
> > In the sematic analysis phase (happens in Analyzer), some predicates are
> > generated from the original query (e.g. "a.id = b.x", "a.id =
> > b.tinyint_col"). Some are generated for slot equivalence between inline
> > view and the outer scope (e.g. "b.x = id + id", "b.tinyint_col =
> > tinyint_col". Let's call them "alias predicates"). Then all these
> > predicates will be used to generate a slot equivalence graph (i.e.
> > Analyzer.valueTransferGraph. Vertices are slots. Edges mean equivilance).
> >
> > Then in the single node plan generation phase (happens in
> > SingleNodePlanner), Impala will create some predicates base on slot
> > equivalence (e.g. "id + id = tinyint_col" in the inline view b). This
> acts
> > as optimizations like pushing down predicates of the outer scope into the
> > inline view. If the generated predicates from the outer scope can't be
> > pushed down to the inline view (due to LIMIT/OFFSET clauses or analytic
> > functions existence in the inline view like the following query), Impala
> > will create a SELECT node with these generated predicates on top of the
> > inline view plan (happens in SingleNodePlanner#addUnassignedConjuncts,
> > calling Analyzer#createEquivConjuncts).
> >
> > select 1
> > from functional.alltypes a
> > inner join
> >   (select id+id as x, tinyint_col,
> >           int_col*int_col as y, bigint_col
> >    from functional.alltypessmall
> >    order by tinyint_col *limit 20*) b
> > on a.id = b.x and a.id = b.tinyint_col and
> >    a.int_col = b.y and a.int_col = b.bigint_col
> >
> > The test cases in these commits better illustrate the behaviors:
> >
> >
> https://github.com/apache/impala/commit/0752b9fa624d3588f3b15dbb5d1bc60d517412f8
> >
> >
> https://github.com/apache/impala/commit/c8e928119d2787999542e34ca32ccec8d3a82a77
> >
> > Back to our problems, I think the root cause is some information is lost
> > in the valueTranferGraph so we can't assure these:
> >  * Predicates inferred by only "alias predicates" should not be created.
> > They will definitely introduce identity predicates. Although we filter
> them
> > out in SingleNodePlanner#migrateConjunctsToInlineView, the filter logics
> > have exceptions which cause IMPALA-8386. My patch (
> > https://gerrit.cloudera.org/c/12939/) can just fix the exceptions. It'd
> > be better if we could prevent suck kind of predicates being generated.
> >  * Predicates inferred by only inner scope predicates should not be
> > assigned to the outer scope (the opposite direction is ok and act as
> > predicate pushdown). This causes IMPALA-7957.
> >
> > I'm still looking for a thorough solution. Just updates my progress and
> > welcome more discussions!
> >
> > Thanks,
> > Quanlong
> >
> > On Sat, Apr 6, 2019 at 4:48 AM Paul Rogers <[email protected]> wrote:
> >
> >> Hi Quanlong,
> >>
> >> This is a tricky issue. Your query is an outer join. In this case,
> Impala
> >> has logic to “repeat” scan predicates at the outer join level:
> >>
> >> SELECT a, b
> >>   FROM t1 outer t2 on (t1.id <http://t1.id/> = t2.id <http://t2.id/>)
> >>   WHERE t2.c = 10
> >>
> >> This will cause the predicate “t2.c = 10” to appear both in the scan for
> >> t2 and in the JOIN node for the outer join. Why? Because we should
> produce
> >> the same result whether we filter before or after the join, an outer
> join
> >> can introduce a null value for t2.c, and so we need to apply the
> predicate
> >> a second time after the join to enforce the filter rule.
> >>
> >> In your case, two parts of Impala collide. The first is the dubious use
> >> of predicates to track aliases. That is, in your query, Impala will
> create
> >> a predicate “c.a_id = t1.a_id” to express the equivalence of the “two”
> >> columns. This is dubious because aliases are just names; they don’t
> >> actually introduce a new column.
> >>
> >> Said another way, an alias is not the same a join-equivalence, though
> >> Impala seems to attempt to treat them as such.
> >>
> >> In any event, the code that handles “unassigned” outer join predicates
> >> also seems to get triggered for the “phantom” alias predicates, as in
> your
> >> case.
> >>
> >> I looked into this issue at one point, found a workaround, and moved
> onto
> >> other things before I could track the issue down to its root cause in
> the
> >> code. Perhaps your fix sheds light on how to make this bit of Impala’s
> >> implementation work in the general case.
> >>
> >> Thanks,
> >>
> >> - Paul
> >>
> >> > On Apr 5, 2019, at 10:22 AM, Quanlong Huang <[email protected]>
> >> wrote:
> >> >
> >> > I use another query to reproduce the bug:
> >> > explain select * from (select t2.a_id,t2.amount1,t2.amount2
> >> >    from a
> >> >    left outer join (
> >> >        select c.a_id, amount as amount1, amount as amount2
> >> >        from b join c  on b.b_id = c.b_id) t2
> >> >    on a.a_id = t2.a_id) t1
> >> >
> >> > The creation of the "amount = amount" predicate should be traced to
> >> > SingleNodePlanner#migrateConjunctsToInlineView when dealing with
> inline
> >> > view t1. It's an optimization that Impala will create some predicates
> >> based
> >> > on slot equivalences (i.e. Analyzer#valueTransferGraph). However,
> these
> >> > conjuncts may be identity (i.e. things like a = a) which may
> incorrectly
> >> > reject rows with nulls. We already have some logic to remove this kind
> >> of
> >> > conjuncts but the existing checks have exceptions. It's due to the
> wrong
> >> > substitution map is used in checking identity predicates.
> >> >
> >> > Here's a patch to fix it: https://gerrit.cloudera.org/c/12939/
> >> >
> >> > On Fri, Apr 5, 2019 at 3:40 AM Tim Armstrong <[email protected]
> >
> >> > wrote:
> >> >
> >> >> This also reminds me of
> >> https://issues.apache.org/jira/browse/IMPALA-7957
> >> >> since there's an extra predicate added after the LEFT JOIN.
> >> >>
> >> >> On Thu, Apr 4, 2019 at 8:37 AM Quanlong Huang <
> [email protected]
> >> >
> >> >> wrote:
> >> >>
> >> >>> Yes. I'm interested in this and going to look deeper in it tomorrow.
> >> Just
> >> >>> filed a JIRA: https://issues.apache.org/jira/browse/IMPALA-8386
> >> >>>
> >> >>> On Thu, Apr 4, 2019 at 11:20 PM Todd Lipcon <[email protected]>
> >> wrote:
> >> >>>
> >> >>>> Sounds like we should file a high priority JIRA (any "wrong
> results"
> >> >> bugs
> >> >>>> should probably be considered critical or blocker). Quanlong, any
> >> >>> interest
> >> >>>> in working on this issue?
> >> >>>>
> >> >>>> -Todd
> >> >>>>
> >> >>>> On Thu, Apr 4, 2019 at 8:17 AM Quanlong Huang <
> >> [email protected]
> >> >>>
> >> >>>> wrote:
> >> >>>>
> >> >>>>> +1 for Csaba's analysis. Looks like similiar to this
> >> >>>>> https://issues.apache.org/jira/browse/IMPALA-3126
> >> >>>>>
> >> >>>>> On Thu, Apr 4, 2019 at 11:08 PM Csaba Ringhofer <
> >> >>>> [email protected]>
> >> >>>>> wrote:
> >> >>>>>
> >> >>>>>> Hi!
> >> >>>>>>
> >> >>>>>> I have checked the queries, and I can verify that Impala
> >> >> incorrectly
> >> >>>>>> returns 1 row while the same query with Hive (or common sense..)
> >> >>>> returns
> >> >>>>> 2
> >> >>>>>> rows.
> >> >>>>>>
> >> >>>>>>> "but if remove the "t2.amount2"  like this："
> >> >>>>>> Indeed, the issue seems to be related to returning the same
> >> >> aggregate
> >> >>>>> twice
> >> >>>>>> + the fact that one of these values is NULL. The planner
> >> >> introduces a
> >> >>>>>> predicate that checks if amount1=amount2, which is false, if both
> >> >>>> values
> >> >>>>>> are NULL:
> >> >>>>>>
> >> >>>>>> explain select * from (select t2.a_id,t2.amount1,t2.amount2
> >> >>>>>>    from( select a_id from a) t1
> >> >>>>>>    left outer join (
> >> >>>>>>        select c.a_id,sum(amount) as amount1,sum(amount) as
> amount2
> >> >>>>>>        from b join c  on b.b_id = c.b_id group by c.a_id) t2
> >> >>>>>>    on t1.a_id = t2.a_id) t;
> >> >>>>>>
> >> >>>>>> results in:
> >> >>>>>> PLAN-ROOT SINK
> >> >>>>>> |
> >> >>>>>> 05:HASH JOIN [RIGHT OUTER JOIN]
> >> >>>>>> |  hash predicates: c.a_id = a_id
> >> >>>>>> |  other predicates: sum(amount) = sum(amount) <----- I don't
> know
> >> >>> why
> >> >>>>> this
> >> >>>>>> predicate is added.
> >> >>>>>> |  runtime filters: RF000 <- a_id
> >> >>>>>> |  row-size=16B cardinality=2
> >> >>>>>> ....
> >> >>>>>>
> >> >>>>>> If I EXPLAIN the query without the outer select, the sum(amount)
> =
> >> >>>>>> sum(amount) is not added, which explains the difference.
> >> >>>>>>
> >> >>>>>> I do not know why the planner adds this predicate, my guess is
> that
> >> >>>> this
> >> >>>>> is
> >> >>>>>> some kind of bug in Impala.
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On Thu, Apr 4, 2019 at 2:27 PM skyyws <[email protected]> wrote:
> >> >>>>>>
> >> >>>>>>> Hi all, I met a problem of left outer join recently, and I
> >> >>> reproduce
> >> >>>>> this
> >> >>>>>>> problem by some simple test data with three tables a, b, c：
> >> >>>>>>> table A
> >> >>>>>>> +------+
> >> >>>>>>> | a_id |
> >> >>>>>>> +------+
> >> >>>>>>> | 1    |
> >> >>>>>>> | 2    |
> >> >>>>>>> +------+
> >> >>>>>>> table B
> >> >>>>>>> +------+--------+
> >> >>>>>>> | b_id | amount |
> >> >>>>>>> +------+--------+
> >> >>>>>>> | 1    | 10     |
> >> >>>>>>> | 1    | 20     |
> >> >>>>>>> | 2    | NULL   |
> >> >>>>>>> +------+--------+
> >> >>>>>>> table C
> >> >>>>>>> +------+------+
> >> >>>>>>> | a_id | b_id |
> >> >>>>>>> +------+------+
> >> >>>>>>> | 1    | 1    |
> >> >>>>>>> | 2    | 2    |
> >> >>>>>>> +------+------+
> >> >>>>>>> The sql below：
> >> >>>>>>> select count(1) from (
> >> >>>>>>>    select t2.a_id,t2.amount1,t2.amount2
> >> >>>>>>>    from( select a_id from a) t1
> >> >>>>>>>    left outer join (
> >> >>>>>>>        select c.a_id,sum(amount) as amount1,sum(amount) as
> >> >> amount2
> >> >>>>>>>        from b join c  on b.b_id = c.b_id group by c.a_id) t2
> >> >>>>>>>    on t1.a_id = t2.a_id
> >> >>>>>>> ) t;
> >> >>>>>>> +----------+
> >> >>>>>>> | count(1) |
> >> >>>>>>> +----------+
> >> >>>>>>> | 1        |
> >> >>>>>>> +----------+
> >> >>>>>>> but if remove the "t2.amount2"  like this：
> >> >>>>>>> select count(1) from (
> >> >>>>>>>    select t2.a_id,t2.amount1
> >> >>>>>>>    from( select a_id from a) t1
> >> >>>>>>>    left outer join (
> >> >>>>>>>        select c.a_id,sum(amount) as amount1,sum(amount) as
> >> >> amount2
> >> >>>>>>>        from b join c  on b.b_id = c.b_id group by c.a_id) t2
> >> >>>>>>>    on t1.a_id = t2.a_id
> >> >>>>>>> ) t;
> >> >>>>>>> +----------+
> >> >>>>>>> | count(1) |
> >> >>>>>>> +----------+
> >> >>>>>>> | 2        |
> >> >>>>>>> +----------+
> >> >>>>>>> Here is the result of two subquery without count(1)：
> >> >>>>>>> +------+---------+---------+
> >> >>>>>>> | a_id | amount1 | amount2 |
> >> >>>>>>> +------+---------+---------+
> >> >>>>>>> | 1    | 30      | 30      |
> >> >>>>>>> | 2    | NULL    | NULL    |
> >> >>>>>>> +------+---------+---------+             why the count(1) of
> this
> >> >>>>>>> resultset is 1?
> >> >>>>>>> +------+---------+
> >> >>>>>>> | a_id | amount1 |
> >> >>>>>>> +------+---------+
> >> >>>>>>> | 1    | 30      |
> >> >>>>>>> | 2    | NULL    |
> >> >>>>>>> +------+---------+                           why the count(1) of
> >> >>> this
> >> >>>>>>> resultset is 2?
> >> >>>>>>> I want to ask why the first sql return just 1， but second return
> >> >>> 2，is
> >> >>>>>> this
> >> >>>>>>> correct or impala bug？How impala deal with count aggr.?
> >> >>>>>>> If I change the sum to other aggr. function like count/max/min,
> >> >>>> result
> >> >>>>> is
> >> >>>>>>> same. I test this on 2.12.0 and 3.1.0 version.
> >> >>>>>>>
> >> >>>>>>>
> >> >>>>>>
> >> >>>>>
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>> Todd Lipcon
> >> >>>> Software Engineer, Cloudera
> >> >>>>
> >> >>>
> >> >>
> >>
> >>
>

Re: A problem of left outer join when statement contains two aggregation for the same column

Reply via email to