Re: A problem of left outer join when statement contains two aggregation for the same column

Quanlong Huang Wed, 17 Apr 2019 09:45:47 -0700

After examining more codes, I have some updates for these two bugs:

--------------------- IMPALA-8386 ---------------------


>Predicates inferred by only "alias predicates" should not be created. They
will definitely introduce identity predicates. Although we filter them out
in SingleNodePlanner#migrateConjunctsToInlineView, the filter logics have
exceptions which cause IMPALA-8386. My patch
https://gerrit.cloudera.org/c/12939 can just fix the exceptions. It'd be
better if we could prevent suck kind of predicates being generated.

Predicates created from Analyzer#createEquivConjuncts won't be identities
directly. But they can be substituted into identities. The substitution
only happens in SingleNodePlanner#migrateConjunctsToInlineView where we
migrate the created predicates plus those unassigned predicates into the
inline view. So we just need to check potential identity predicates in
SingleNodePlanner#migrateConjunctsToInlineView after calling
Analyzer#createEquivConjuncts. For the checking, we need to resolve those
predicates into base table slots predicates. So I think patch
https://gerrit.cloudera.org/c/12939 is sufficient to fix IMPALA-8386.

--------------------- IMPALA-7957 ---------------------

>Predicates inferred by only inner scope predicates should not be assigned
to the outer scope (the opposite direction is ok and act as predicate
pushdown). This causes IMPALA-7957.

My statement in the last email is wrong. The problem is that slot
equivalences are enforced several times. Take this simpler query to
reproduce the bug:
SELECT a.id
FROM functional.alltypestiny a
LEFT JOIN
  (SELECT id, int_col
   FROM functional.alltypestiny b
   WHERE int_col = id) t
ON (a.id = t.id)
UNION ALL
VALUES (NULL)

Without the "UNION ALL VALUES (NULL)" clause, the plan is correct. With the
UNION clause, the plan has a SELECT node executing the predicate "b.id =
b.int_col" on top of the LEFT JOIN node, which wrongly rejects rows
containing nulls:

+------------------------------------------------------------+

| Explain String                                             |

+------------------------------------------------------------+

| Max Per-Host Resource Reservation: Memory=1.95MB Threads=5 |

| Per-Host Resource Estimates: Memory=162MB                  |

| Codegen disabled by planner                                |

|                                                            |

| PLAN-ROOT SINK                                             |

| |                                                          |

| 06:EXCHANGE [UNPARTITIONED]                                |

| |                                                          |

| 00:UNION                                                   |

| |  constant-operands=1                                     |

| |  row-size=4B cardinality=2                               |

| |                                                          |

| 04:SELECT                                                  |

| |  predicates: b.id = b.int_col      <-------------------------- Wrong
here!

| |  row-size=12B cardinality=1                              |

| |                                                          |

| 03:HASH JOIN [LEFT OUTER JOIN, BROADCAST]                  |

| |  hash predicates: a.id = id                              |

| |  row-size=12B cardinality=8                              |

| |                                                          |

| |--05:EXCHANGE [BROADCAST]                                 |

| |  |                                                       |

| |  02:SCAN HDFS [functional.alltypestiny b]                |

| |     partitions=4/4 files=4 size=460B                     |

| |     predicates: int_col = id                             |

| |     row-size=8B cardinality=1                            |

| |                                                          |

| 01:SCAN HDFS [functional.alltypestiny a]                   |

|    partitions=4/4 files=4 size=460B                        |

|    row-size=4B cardinality=8                               |

+------------------------------------------------------------+

Here are how this happen:
* After semantic analysis, there're 4 predicates. Two of them are from the
query: "b.int_col = b.id", "a.id = t.id". The other two are auxiliary
predicates generated for slot equavilance: "t.id = b.id", "t.int_col =
b.int_col".
* When migrating conjuncts into inline view t (happens in
SingleNodePlanner#migrateConjunctsToInlineView), a predicate "t.id =
t.int_col" is inferred. Then it's substituted (migrated) to "b.id =
b.int_col" and assigned to the scan node for b. When creating this scan
node, the redundant inferred predicate "b.id = b.int_col" is removed. (It's
redundant to the origin predicate "b.int_col = b.id").
* At the last step of creating the UNION operand for the JOIN side, if
there are still unassigned predicates, or we can still inferred some
predicates, the planner will create a SELECT node with those predicates on
top of the UNION operand. (See SingleNodePlanner#addUnassignedConjuncts).
Unfortunately, a predicate "b.id = b.int_col" is inferred again here thus
creating the incorrect SELECT node.

I just uploaded a patch for review: https://gerrit.cloudera.org/c/13051 .
The trace logs it adds can help to reveal this.

Can anyone help to review these two patches?
* https://gerrit.cloudera.org/c/12939
* https://gerrit.cloudera.org/c/13051

Thanks,
Quanlong

On Mon, Apr 8, 2019 at 12:02 PM Quanlong Huang <huangquanl...@gmail.com>
wrote:

> Hi Paul,
>
> I think you were tracking
> https://issues.apache.org/jira/browse/IMPALA-7957 where the predicate "c1
> = c2" in the WHERE clause of inline view t2 is wrongly duplicated to an
> upper SELECT node. Its cause differs from IMPALA-8386 in where a wrong
> identity predicate "amount = amount" is generated and wrongly rejects
> nulls. However, while digging deeper, I agree with you that the root cause
> of these two cases may be the same: we should distinguish different kinds
> of predicates and use them carefully.
>
> Take this query as an example:
>
> select 1
> from functional.alltypes a
> inner join
>   (select id+id as x, tinyint_col,
>           int_col*int_col as y, bigint_col
>    from functional.alltypessmall) b
> on a.id = b.x and a.id = b.tinyint_col and
>    a.int_col = b.y and a.int_col = b.bigint_col
>
> In the sematic analysis phase (happens in Analyzer), some predicates are
> generated from the original query (e.g. "a.id = b.x", "a.id =
> b.tinyint_col"). Some are generated for slot equivalence between inline
> view and the outer scope (e.g. "b.x = id + id", "b.tinyint_col =
> tinyint_col". Let's call them "alias predicates"). Then all these
> predicates will be used to generate a slot equivalence graph (i.e.
> Analyzer.valueTransferGraph. Vertices are slots. Edges mean equivilance).
>
> Then in the single node plan generation phase (happens in
> SingleNodePlanner), Impala will create some predicates base on slot
> equivalence (e.g. "id + id = tinyint_col" in the inline view b). This acts
> as optimizations like pushing down predicates of the outer scope into the
> inline view. If the generated predicates from the outer scope can't be
> pushed down to the inline view (due to LIMIT/OFFSET clauses or analytic
> functions existence in the inline view like the following query), Impala
> will create a SELECT node with these generated predicates on top of the
> inline view plan (happens in SingleNodePlanner#addUnassignedConjuncts,
> calling Analyzer#createEquivConjuncts).
>
> select 1
> from functional.alltypes a
> inner join
>   (select id+id as x, tinyint_col,
>           int_col*int_col as y, bigint_col
>    from functional.alltypessmall
>    order by tinyint_col *limit 20*) b
> on a.id = b.x and a.id = b.tinyint_col and
>    a.int_col = b.y and a.int_col = b.bigint_col
>
> The test cases in these commits better illustrate the behaviors:
>
> https://github.com/apache/impala/commit/0752b9fa624d3588f3b15dbb5d1bc60d517412f8
>
> https://github.com/apache/impala/commit/c8e928119d2787999542e34ca32ccec8d3a82a77
>
> Back to our problems, I think the root cause is some information is lost
> in the valueTranferGraph so we can't assure these:
>  * Predicates inferred by only "alias predicates" should not be created.
> They will definitely introduce identity predicates. Although we filter them
> out in SingleNodePlanner#migrateConjunctsToInlineView, the filter logics
> have exceptions which cause IMPALA-8386. My patch (
> https://gerrit.cloudera.org/c/12939/) can just fix the exceptions. It'd
> be better if we could prevent suck kind of predicates being generated.
>  * Predicates inferred by only inner scope predicates should not be
> assigned to the outer scope (the opposite direction is ok and act as
> predicate pushdown). This causes IMPALA-7957.
>
> I'm still looking for a thorough solution. Just updates my progress and
> welcome more discussions!
>
> Thanks,
> Quanlong
>
> On Sat, Apr 6, 2019 at 4:48 AM Paul Rogers <prog...@cloudera.com> wrote:
>
>> Hi Quanlong,
>>
>> This is a tricky issue. Your query is an outer join. In this case, Impala
>> has logic to “repeat” scan predicates at the outer join level:
>>
>> SELECT a, b
>>   FROM t1 outer t2 on (t1.id <http://t1.id/> = t2.id <http://t2.id/>)
>>   WHERE t2.c = 10
>>
>> This will cause the predicate “t2.c = 10” to appear both in the scan for
>> t2 and in the JOIN node for the outer join. Why? Because we should produce
>> the same result whether we filter before or after the join, an outer join
>> can introduce a null value for t2.c, and so we need to apply the predicate
>> a second time after the join to enforce the filter rule.
>>
>> In your case, two parts of Impala collide. The first is the dubious use
>> of predicates to track aliases. That is, in your query, Impala will create
>> a predicate “c.a_id = t1.a_id” to express the equivalence of the “two”
>> columns. This is dubious because aliases are just names; they don’t
>> actually introduce a new column.
>>
>> Said another way, an alias is not the same a join-equivalence, though
>> Impala seems to attempt to treat them as such.
>>
>> In any event, the code that handles “unassigned” outer join predicates
>> also seems to get triggered for the “phantom” alias predicates, as in your
>> case.
>>
>> I looked into this issue at one point, found a workaround, and moved onto
>> other things before I could track the issue down to its root cause in the
>> code. Perhaps your fix sheds light on how to make this bit of Impala’s
>> implementation work in the general case.
>>
>> Thanks,
>>
>> - Paul
>>
>> > On Apr 5, 2019, at 10:22 AM, Quanlong Huang <huangquanl...@gmail.com>
>> wrote:
>> >
>> > I use another query to reproduce the bug:
>> > explain select * from (select t2.a_id,t2.amount1,t2.amount2
>> >    from a
>> >    left outer join (
>> >        select c.a_id, amount as amount1, amount as amount2
>> >        from b join c  on b.b_id = c.b_id) t2
>> >    on a.a_id = t2.a_id) t1
>> >
>> > The creation of the "amount = amount" predicate should be traced to
>> > SingleNodePlanner#migrateConjunctsToInlineView when dealing with inline
>> > view t1. It's an optimization that Impala will create some predicates
>> based
>> > on slot equivalences (i.e. Analyzer#valueTransferGraph). However, these
>> > conjuncts may be identity (i.e. things like a = a) which may incorrectly
>> > reject rows with nulls. We already have some logic to remove this kind
>> of
>> > conjuncts but the existing checks have exceptions. It's due to the wrong
>> > substitution map is used in checking identity predicates.
>> >
>> > Here's a patch to fix it: https://gerrit.cloudera.org/c/12939/
>> >
>> > On Fri, Apr 5, 2019 at 3:40 AM Tim Armstrong <tarmstr...@cloudera.com>
>> > wrote:
>> >
>> >> This also reminds me of
>> https://issues.apache.org/jira/browse/IMPALA-7957
>> >> since there's an extra predicate added after the LEFT JOIN.
>> >>
>> >> On Thu, Apr 4, 2019 at 8:37 AM Quanlong Huang <huangquanl...@gmail.com
>> >
>> >> wrote:
>> >>
>> >>> Yes. I'm interested in this and going to look deeper in it tomorrow.
>> Just
>> >>> filed a JIRA: https://issues.apache.org/jira/browse/IMPALA-8386
>> >>>
>> >>> On Thu, Apr 4, 2019 at 11:20 PM Todd Lipcon <t...@cloudera.com>
>> wrote:
>> >>>
>> >>>> Sounds like we should file a high priority JIRA (any "wrong results"
>> >> bugs
>> >>>> should probably be considered critical or blocker). Quanlong, any
>> >>> interest
>> >>>> in working on this issue?
>> >>>>
>> >>>> -Todd
>> >>>>
>> >>>> On Thu, Apr 4, 2019 at 8:17 AM Quanlong Huang <
>> huangquanl...@gmail.com
>> >>>
>> >>>> wrote:
>> >>>>
>> >>>>> +1 for Csaba's analysis. Looks like similiar to this
>> >>>>> https://issues.apache.org/jira/browse/IMPALA-3126
>> >>>>>
>> >>>>> On Thu, Apr 4, 2019 at 11:08 PM Csaba Ringhofer <
>> >>>> csringho...@cloudera.com>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> Hi!
>> >>>>>>
>> >>>>>> I have checked the queries, and I can verify that Impala
>> >> incorrectly
>> >>>>>> returns 1 row while the same query with Hive (or common sense..)
>> >>>> returns
>> >>>>> 2
>> >>>>>> rows.
>> >>>>>>
>> >>>>>>> "but if remove the "t2.amount2"  like this："
>> >>>>>> Indeed, the issue seems to be related to returning the same
>> >> aggregate
>> >>>>> twice
>> >>>>>> + the fact that one of these values is NULL. The planner
>> >> introduces a
>> >>>>>> predicate that checks if amount1=amount2, which is false, if both
>> >>>> values
>> >>>>>> are NULL:
>> >>>>>>
>> >>>>>> explain select * from (select t2.a_id,t2.amount1,t2.amount2
>> >>>>>>    from( select a_id from a) t1
>> >>>>>>    left outer join (
>> >>>>>>        select c.a_id,sum(amount) as amount1,sum(amount) as amount2
>> >>>>>>        from b join c  on b.b_id = c.b_id group by c.a_id) t2
>> >>>>>>    on t1.a_id = t2.a_id) t;
>> >>>>>>
>> >>>>>> results in:
>> >>>>>> PLAN-ROOT SINK
>> >>>>>> |
>> >>>>>> 05:HASH JOIN [RIGHT OUTER JOIN]
>> >>>>>> |  hash predicates: c.a_id = a_id
>> >>>>>> |  other predicates: sum(amount) = sum(amount) <----- I don't know
>> >>> why
>> >>>>> this
>> >>>>>> predicate is added.
>> >>>>>> |  runtime filters: RF000 <- a_id
>> >>>>>> |  row-size=16B cardinality=2
>> >>>>>> ....
>> >>>>>>
>> >>>>>> If I EXPLAIN the query without the outer select, the sum(amount) =
>> >>>>>> sum(amount) is not added, which explains the difference.
>> >>>>>>
>> >>>>>> I do not know why the planner adds this predicate, my guess is that
>> >>>> this
>> >>>>> is
>> >>>>>> some kind of bug in Impala.
>> >>>>>>
>> >>>>>>
>> >>>>>> On Thu, Apr 4, 2019 at 2:27 PM skyyws <sky...@163.com> wrote:
>> >>>>>>
>> >>>>>>> Hi all, I met a problem of left outer join recently, and I
>> >>> reproduce
>> >>>>> this
>> >>>>>>> problem by some simple test data with three tables a, b, c：
>> >>>>>>> table A
>> >>>>>>> +------+
>> >>>>>>> | a_id |
>> >>>>>>> +------+
>> >>>>>>> | 1    |
>> >>>>>>> | 2    |
>> >>>>>>> +------+
>> >>>>>>> table B
>> >>>>>>> +------+--------+
>> >>>>>>> | b_id | amount |
>> >>>>>>> +------+--------+
>> >>>>>>> | 1    | 10     |
>> >>>>>>> | 1    | 20     |
>> >>>>>>> | 2    | NULL   |
>> >>>>>>> +------+--------+
>> >>>>>>> table C
>> >>>>>>> +------+------+
>> >>>>>>> | a_id | b_id |
>> >>>>>>> +------+------+
>> >>>>>>> | 1    | 1    |
>> >>>>>>> | 2    | 2    |
>> >>>>>>> +------+------+
>> >>>>>>> The sql below：
>> >>>>>>> select count(1) from (
>> >>>>>>>    select t2.a_id,t2.amount1,t2.amount2
>> >>>>>>>    from( select a_id from a) t1
>> >>>>>>>    left outer join (
>> >>>>>>>        select c.a_id,sum(amount) as amount1,sum(amount) as
>> >> amount2
>> >>>>>>>        from b join c  on b.b_id = c.b_id group by c.a_id) t2
>> >>>>>>>    on t1.a_id = t2.a_id
>> >>>>>>> ) t;
>> >>>>>>> +----------+
>> >>>>>>> | count(1) |
>> >>>>>>> +----------+
>> >>>>>>> | 1        |
>> >>>>>>> +----------+
>> >>>>>>> but if remove the "t2.amount2"  like this：
>> >>>>>>> select count(1) from (
>> >>>>>>>    select t2.a_id,t2.amount1
>> >>>>>>>    from( select a_id from a) t1
>> >>>>>>>    left outer join (
>> >>>>>>>        select c.a_id,sum(amount) as amount1,sum(amount) as
>> >> amount2
>> >>>>>>>        from b join c  on b.b_id = c.b_id group by c.a_id) t2
>> >>>>>>>    on t1.a_id = t2.a_id
>> >>>>>>> ) t;
>> >>>>>>> +----------+
>> >>>>>>> | count(1) |
>> >>>>>>> +----------+
>> >>>>>>> | 2        |
>> >>>>>>> +----------+
>> >>>>>>> Here is the result of two subquery without count(1)：
>> >>>>>>> +------+---------+---------+
>> >>>>>>> | a_id | amount1 | amount2 |
>> >>>>>>> +------+---------+---------+
>> >>>>>>> | 1    | 30      | 30      |
>> >>>>>>> | 2    | NULL    | NULL    |
>> >>>>>>> +------+---------+---------+             why the count(1) of this
>> >>>>>>> resultset is 1?
>> >>>>>>> +------+---------+
>> >>>>>>> | a_id | amount1 |
>> >>>>>>> +------+---------+
>> >>>>>>> | 1    | 30      |
>> >>>>>>> | 2    | NULL    |
>> >>>>>>> +------+---------+                           why the count(1) of
>> >>> this
>> >>>>>>> resultset is 2?
>> >>>>>>> I want to ask why the first sql return just 1， but second return
>> >>> 2，is
>> >>>>>> this
>> >>>>>>> correct or impala bug？How impala deal with count aggr.?
>> >>>>>>> If I change the sum to other aggr. function like count/max/min,
>> >>>> result
>> >>>>> is
>> >>>>>>> same. I test this on 2.12.0 and 3.1.0 version.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Todd Lipcon
>> >>>> Software Engineer, Cloudera
>> >>>>
>> >>>
>> >>
>>
>>

Re: A problem of left outer join when statement contains two aggregation for the same column

Reply via email to