Re: A problem of left outer join when statement contains two aggregation for the same column

Quanlong Huang Sun, 07 Apr 2019 21:04:42 -0700

Hi Paul,

I think you were tracking https://issues.apache.org/jira/browse/IMPALA-7957
where the predicate "c1 = c2" in the WHERE clause of inline view t2 is
wrongly duplicated to an upper SELECT node. Its cause differs from
IMPALA-8386 in where a wrong identity predicate "amount = amount" is
generated and wrongly rejects nulls. However, while digging deeper, I agree
with you that the root cause of these two cases may be the same: we should
distinguish different kinds of predicates and use them carefully.


Take this query as an example:

select 1
from functional.alltypes a
inner join
  (select id+id as x, tinyint_col,
          int_col*int_col as y, bigint_col
   from functional.alltypessmall) b
on a.id = b.x and a.id = b.tinyint_col and
   a.int_col = b.y and a.int_col = b.bigint_col

In the sematic analysis phase (happens in Analyzer), some predicates are
generated from the original query (e.g. "a.id = b.x", "a.id =
b.tinyint_col"). Some are generated for slot equivalence between inline
view and the outer scope (e.g. "b.x = id + id", "b.tinyint_col =
tinyint_col". Let's call them "alias predicates"). Then all these
predicates will be used to generate a slot equivalence graph (i.e.
Analyzer.valueTransferGraph. Vertices are slots. Edges mean equivilance).

Then in the single node plan generation phase (happens in
SingleNodePlanner), Impala will create some predicates base on slot
equivalence (e.g. "id + id = tinyint_col" in the inline view b). This acts
as optimizations like pushing down predicates of the outer scope into the
inline view. If the generated predicates from the outer scope can't be
pushed down to the inline view (due to LIMIT/OFFSET clauses or analytic
functions existence in the inline view like the following query), Impala
will create a SELECT node with these generated predicates on top of the
inline view plan (happens in SingleNodePlanner#addUnassignedConjuncts,
calling Analyzer#createEquivConjuncts).

select 1
from functional.alltypes a
inner join
  (select id+id as x, tinyint_col,
          int_col*int_col as y, bigint_col
   from functional.alltypessmall
   order by tinyint_col *limit 20*) b
on a.id = b.x and a.id = b.tinyint_col and
   a.int_col = b.y and a.int_col = b.bigint_col

The test cases in these commits better illustrate the behaviors:
https://github.com/apache/impala/commit/0752b9fa624d3588f3b15dbb5d1bc60d517412f8
https://github.com/apache/impala/commit/c8e928119d2787999542e34ca32ccec8d3a82a77

Back to our problems, I think the root cause is some information is lost in
the valueTranferGraph so we can't assure these:
 * Predicates inferred by only "alias predicates" should not be created.
They will definitely introduce identity predicates. Although we filter them
out in SingleNodePlanner#migrateConjunctsToInlineView, the filter logics
have exceptions which cause IMPALA-8386. My patch (
https://gerrit.cloudera.org/c/12939/) can just fix the exceptions. It'd be
better if we could prevent suck kind of predicates being generated.
 * Predicates inferred by only inner scope predicates should not be
assigned to the outer scope (the opposite direction is ok and act as
predicate pushdown). This causes IMPALA-7957.

I'm still looking for a thorough solution. Just updates my progress and
welcome more discussions!

Thanks,
Quanlong

On Sat, Apr 6, 2019 at 4:48 AM Paul Rogers <[email protected]> wrote:

> Hi Quanlong,
>
> This is a tricky issue. Your query is an outer join. In this case, Impala
> has logic to “repeat” scan predicates at the outer join level:
>
> SELECT a, b
>   FROM t1 outer t2 on (t1.id <http://t1.id/> = t2.id <http://t2.id/>)
>   WHERE t2.c = 10
>
> This will cause the predicate “t2.c = 10” to appear both in the scan for
> t2 and in the JOIN node for the outer join. Why? Because we should produce
> the same result whether we filter before or after the join, an outer join
> can introduce a null value for t2.c, and so we need to apply the predicate
> a second time after the join to enforce the filter rule.
>
> In your case, two parts of Impala collide. The first is the dubious use of
> predicates to track aliases. That is, in your query, Impala will create a
> predicate “c.a_id = t1.a_id” to express the equivalence of the “two”
> columns. This is dubious because aliases are just names; they don’t
> actually introduce a new column.
>
> Said another way, an alias is not the same a join-equivalence, though
> Impala seems to attempt to treat them as such.
>
> In any event, the code that handles “unassigned” outer join predicates
> also seems to get triggered for the “phantom” alias predicates, as in your
> case.
>
> I looked into this issue at one point, found a workaround, and moved onto
> other things before I could track the issue down to its root cause in the
> code. Perhaps your fix sheds light on how to make this bit of Impala’s
> implementation work in the general case.
>
> Thanks,
>
> - Paul
>
> > On Apr 5, 2019, at 10:22 AM, Quanlong Huang <[email protected]>
> wrote:
> >
> > I use another query to reproduce the bug:
> > explain select * from (select t2.a_id,t2.amount1,t2.amount2
> >    from a
> >    left outer join (
> >        select c.a_id, amount as amount1, amount as amount2
> >        from b join c  on b.b_id = c.b_id) t2
> >    on a.a_id = t2.a_id) t1
> >
> > The creation of the "amount = amount" predicate should be traced to
> > SingleNodePlanner#migrateConjunctsToInlineView when dealing with inline
> > view t1. It's an optimization that Impala will create some predicates
> based
> > on slot equivalences (i.e. Analyzer#valueTransferGraph). However, these
> > conjuncts may be identity (i.e. things like a = a) which may incorrectly
> > reject rows with nulls. We already have some logic to remove this kind of
> > conjuncts but the existing checks have exceptions. It's due to the wrong
> > substitution map is used in checking identity predicates.
> >
> > Here's a patch to fix it: https://gerrit.cloudera.org/c/12939/
> >
> > On Fri, Apr 5, 2019 at 3:40 AM Tim Armstrong <[email protected]>
> > wrote:
> >
> >> This also reminds me of
> https://issues.apache.org/jira/browse/IMPALA-7957
> >> since there's an extra predicate added after the LEFT JOIN.
> >>
> >> On Thu, Apr 4, 2019 at 8:37 AM Quanlong Huang <[email protected]>
> >> wrote:
> >>
> >>> Yes. I'm interested in this and going to look deeper in it tomorrow.
> Just
> >>> filed a JIRA: https://issues.apache.org/jira/browse/IMPALA-8386
> >>>
> >>> On Thu, Apr 4, 2019 at 11:20 PM Todd Lipcon <[email protected]> wrote:
> >>>
> >>>> Sounds like we should file a high priority JIRA (any "wrong results"
> >> bugs
> >>>> should probably be considered critical or blocker). Quanlong, any
> >>> interest
> >>>> in working on this issue?
> >>>>
> >>>> -Todd
> >>>>
> >>>> On Thu, Apr 4, 2019 at 8:17 AM Quanlong Huang <
> [email protected]
> >>>
> >>>> wrote:
> >>>>
> >>>>> +1 for Csaba's analysis. Looks like similiar to this
> >>>>> https://issues.apache.org/jira/browse/IMPALA-3126
> >>>>>
> >>>>> On Thu, Apr 4, 2019 at 11:08 PM Csaba Ringhofer <
> >>>> [email protected]>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi!
> >>>>>>
> >>>>>> I have checked the queries, and I can verify that Impala
> >> incorrectly
> >>>>>> returns 1 row while the same query with Hive (or common sense..)
> >>>> returns
> >>>>> 2
> >>>>>> rows.
> >>>>>>
> >>>>>>> "but if remove the "t2.amount2"  like this："
> >>>>>> Indeed, the issue seems to be related to returning the same
> >> aggregate
> >>>>> twice
> >>>>>> + the fact that one of these values is NULL. The planner
> >> introduces a
> >>>>>> predicate that checks if amount1=amount2, which is false, if both
> >>>> values
> >>>>>> are NULL:
> >>>>>>
> >>>>>> explain select * from (select t2.a_id,t2.amount1,t2.amount2
> >>>>>>    from( select a_id from a) t1
> >>>>>>    left outer join (
> >>>>>>        select c.a_id,sum(amount) as amount1,sum(amount) as amount2
> >>>>>>        from b join c  on b.b_id = c.b_id group by c.a_id) t2
> >>>>>>    on t1.a_id = t2.a_id) t;
> >>>>>>
> >>>>>> results in:
> >>>>>> PLAN-ROOT SINK
> >>>>>> |
> >>>>>> 05:HASH JOIN [RIGHT OUTER JOIN]
> >>>>>> |  hash predicates: c.a_id = a_id
> >>>>>> |  other predicates: sum(amount) = sum(amount) <----- I don't know
> >>> why
> >>>>> this
> >>>>>> predicate is added.
> >>>>>> |  runtime filters: RF000 <- a_id
> >>>>>> |  row-size=16B cardinality=2
> >>>>>> ....
> >>>>>>
> >>>>>> If I EXPLAIN the query without the outer select, the sum(amount) =
> >>>>>> sum(amount) is not added, which explains the difference.
> >>>>>>
> >>>>>> I do not know why the planner adds this predicate, my guess is that
> >>>> this
> >>>>> is
> >>>>>> some kind of bug in Impala.
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Apr 4, 2019 at 2:27 PM skyyws <[email protected]> wrote:
> >>>>>>
> >>>>>>> Hi all, I met a problem of left outer join recently, and I
> >>> reproduce
> >>>>> this
> >>>>>>> problem by some simple test data with three tables a, b, c：
> >>>>>>> table A
> >>>>>>> +------+
> >>>>>>> | a_id |
> >>>>>>> +------+
> >>>>>>> | 1    |
> >>>>>>> | 2    |
> >>>>>>> +------+
> >>>>>>> table B
> >>>>>>> +------+--------+
> >>>>>>> | b_id | amount |
> >>>>>>> +------+--------+
> >>>>>>> | 1    | 10     |
> >>>>>>> | 1    | 20     |
> >>>>>>> | 2    | NULL   |
> >>>>>>> +------+--------+
> >>>>>>> table C
> >>>>>>> +------+------+
> >>>>>>> | a_id | b_id |
> >>>>>>> +------+------+
> >>>>>>> | 1    | 1    |
> >>>>>>> | 2    | 2    |
> >>>>>>> +------+------+
> >>>>>>> The sql below：
> >>>>>>> select count(1) from (
> >>>>>>>    select t2.a_id,t2.amount1,t2.amount2
> >>>>>>>    from( select a_id from a) t1
> >>>>>>>    left outer join (
> >>>>>>>        select c.a_id,sum(amount) as amount1,sum(amount) as
> >> amount2
> >>>>>>>        from b join c  on b.b_id = c.b_id group by c.a_id) t2
> >>>>>>>    on t1.a_id = t2.a_id
> >>>>>>> ) t;
> >>>>>>> +----------+
> >>>>>>> | count(1) |
> >>>>>>> +----------+
> >>>>>>> | 1        |
> >>>>>>> +----------+
> >>>>>>> but if remove the "t2.amount2"  like this：
> >>>>>>> select count(1) from (
> >>>>>>>    select t2.a_id,t2.amount1
> >>>>>>>    from( select a_id from a) t1
> >>>>>>>    left outer join (
> >>>>>>>        select c.a_id,sum(amount) as amount1,sum(amount) as
> >> amount2
> >>>>>>>        from b join c  on b.b_id = c.b_id group by c.a_id) t2
> >>>>>>>    on t1.a_id = t2.a_id
> >>>>>>> ) t;
> >>>>>>> +----------+
> >>>>>>> | count(1) |
> >>>>>>> +----------+
> >>>>>>> | 2        |
> >>>>>>> +----------+
> >>>>>>> Here is the result of two subquery without count(1)：
> >>>>>>> +------+---------+---------+
> >>>>>>> | a_id | amount1 | amount2 |
> >>>>>>> +------+---------+---------+
> >>>>>>> | 1    | 30      | 30      |
> >>>>>>> | 2    | NULL    | NULL    |
> >>>>>>> +------+---------+---------+             why the count(1) of this
> >>>>>>> resultset is 1?
> >>>>>>> +------+---------+
> >>>>>>> | a_id | amount1 |
> >>>>>>> +------+---------+
> >>>>>>> | 1    | 30      |
> >>>>>>> | 2    | NULL    |
> >>>>>>> +------+---------+                           why the count(1) of
> >>> this
> >>>>>>> resultset is 2?
> >>>>>>> I want to ask why the first sql return just 1， but second return
> >>> 2，is
> >>>>>> this
> >>>>>>> correct or impala bug？How impala deal with count aggr.?
> >>>>>>> If I change the sum to other aggr. function like count/max/min,
> >>>> result
> >>>>> is
> >>>>>>> same. I test this on 2.12.0 and 3.1.0 version.
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Todd Lipcon
> >>>> Software Engineer, Cloudera
> >>>>
> >>>
> >>
>
>

Re: A problem of left outer join when statement contains two aggregation for the same column

Reply via email to