Re: A problem of left outer join when statement contains two aggregation for the same column

Paul Rogers Fri, 05 Apr 2019 13:48:45 -0700

Hi Quanlong,

This is a tricky issue. Your query is an outer join. In this case, Impala has 
logic to “repeat” scan predicates at the outer join level:


SELECT a, b
  FROM t1 outer t2 on (t1.id <http://t1.id/> = t2.id <http://t2.id/>)
  WHERE t2.c = 10

This will cause the predicate “t2.c = 10” to appear both in the scan for t2 and 
in the JOIN node for the outer join. Why? Because we should produce the same 
result whether we filter before or after the join, an outer join can introduce 
a null value for t2.c, and so we need to apply the predicate a second time 
after the join to enforce the filter rule.

In your case, two parts of Impala collide. The first is the dubious use of 
predicates to track aliases. That is, in your query, Impala will create a 
predicate “c.a_id = t1.a_id” to express the equivalence of the “two” columns. 
This is dubious because aliases are just names; they don’t actually introduce a 
new column.

Said another way, an alias is not the same a join-equivalence, though Impala 
seems to attempt to treat them as such.

In any event, the code that handles “unassigned” outer join predicates also 
seems to get triggered for the “phantom” alias predicates, as in your case.

I looked into this issue at one point, found a workaround, and moved onto other 
things before I could track the issue down to its root cause in the code. 
Perhaps your fix sheds light on how to make this bit of Impala’s implementation 
work in the general case.

Thanks,

- Paul

> On Apr 5, 2019, at 10:22 AM, Quanlong Huang <[email protected]> wrote:
> 
> I use another query to reproduce the bug:
> explain select * from (select t2.a_id,t2.amount1,t2.amount2
>    from a
>    left outer join (
>        select c.a_id, amount as amount1, amount as amount2
>        from b join c  on b.b_id = c.b_id) t2
>    on a.a_id = t2.a_id) t1
> 
> The creation of the "amount = amount" predicate should be traced to
> SingleNodePlanner#migrateConjunctsToInlineView when dealing with inline
> view t1. It's an optimization that Impala will create some predicates based
> on slot equivalences (i.e. Analyzer#valueTransferGraph). However, these
> conjuncts may be identity (i.e. things like a = a) which may incorrectly
> reject rows with nulls. We already have some logic to remove this kind of
> conjuncts but the existing checks have exceptions. It's due to the wrong
> substitution map is used in checking identity predicates.
> 
> Here's a patch to fix it: https://gerrit.cloudera.org/c/12939/
> 
> On Fri, Apr 5, 2019 at 3:40 AM Tim Armstrong <[email protected]>
> wrote:
> 
>> This also reminds me of https://issues.apache.org/jira/browse/IMPALA-7957
>> since there's an extra predicate added after the LEFT JOIN.
>> 
>> On Thu, Apr 4, 2019 at 8:37 AM Quanlong Huang <[email protected]>
>> wrote:
>> 
>>> Yes. I'm interested in this and going to look deeper in it tomorrow. Just
>>> filed a JIRA: https://issues.apache.org/jira/browse/IMPALA-8386
>>> 
>>> On Thu, Apr 4, 2019 at 11:20 PM Todd Lipcon <[email protected]> wrote:
>>> 
>>>> Sounds like we should file a high priority JIRA (any "wrong results"
>> bugs
>>>> should probably be considered critical or blocker). Quanlong, any
>>> interest
>>>> in working on this issue?
>>>> 
>>>> -Todd
>>>> 
>>>> On Thu, Apr 4, 2019 at 8:17 AM Quanlong Huang <[email protected]
>>> 
>>>> wrote:
>>>> 
>>>>> +1 for Csaba's analysis. Looks like similiar to this
>>>>> https://issues.apache.org/jira/browse/IMPALA-3126
>>>>> 
>>>>> On Thu, Apr 4, 2019 at 11:08 PM Csaba Ringhofer <
>>>> [email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Hi!
>>>>>> 
>>>>>> I have checked the queries, and I can verify that Impala
>> incorrectly
>>>>>> returns 1 row while the same query with Hive (or common sense..)
>>>> returns
>>>>> 2
>>>>>> rows.
>>>>>> 
>>>>>>> "but if remove the "t2.amount2"  like this："
>>>>>> Indeed, the issue seems to be related to returning the same
>> aggregate
>>>>> twice
>>>>>> + the fact that one of these values is NULL. The planner
>> introduces a
>>>>>> predicate that checks if amount1=amount2, which is false, if both
>>>> values
>>>>>> are NULL:
>>>>>> 
>>>>>> explain select * from (select t2.a_id,t2.amount1,t2.amount2
>>>>>>    from( select a_id from a) t1
>>>>>>    left outer join (
>>>>>>        select c.a_id,sum(amount) as amount1,sum(amount) as amount2
>>>>>>        from b join c  on b.b_id = c.b_id group by c.a_id) t2
>>>>>>    on t1.a_id = t2.a_id) t;
>>>>>> 
>>>>>> results in:
>>>>>> PLAN-ROOT SINK
>>>>>> |
>>>>>> 05:HASH JOIN [RIGHT OUTER JOIN]
>>>>>> |  hash predicates: c.a_id = a_id
>>>>>> |  other predicates: sum(amount) = sum(amount) <----- I don't know
>>> why
>>>>> this
>>>>>> predicate is added.
>>>>>> |  runtime filters: RF000 <- a_id
>>>>>> |  row-size=16B cardinality=2
>>>>>> ....
>>>>>> 
>>>>>> If I EXPLAIN the query without the outer select, the sum(amount) =
>>>>>> sum(amount) is not added, which explains the difference.
>>>>>> 
>>>>>> I do not know why the planner adds this predicate, my guess is that
>>>> this
>>>>> is
>>>>>> some kind of bug in Impala.
>>>>>> 
>>>>>> 
>>>>>> On Thu, Apr 4, 2019 at 2:27 PM skyyws <[email protected]> wrote:
>>>>>> 
>>>>>>> Hi all, I met a problem of left outer join recently, and I
>>> reproduce
>>>>> this
>>>>>>> problem by some simple test data with three tables a, b, c：
>>>>>>> table A
>>>>>>> +------+
>>>>>>> | a_id |
>>>>>>> +------+
>>>>>>> | 1    |
>>>>>>> | 2    |
>>>>>>> +------+
>>>>>>> table B
>>>>>>> +------+--------+
>>>>>>> | b_id | amount |
>>>>>>> +------+--------+
>>>>>>> | 1    | 10     |
>>>>>>> | 1    | 20     |
>>>>>>> | 2    | NULL   |
>>>>>>> +------+--------+
>>>>>>> table C
>>>>>>> +------+------+
>>>>>>> | a_id | b_id |
>>>>>>> +------+------+
>>>>>>> | 1    | 1    |
>>>>>>> | 2    | 2    |
>>>>>>> +------+------+
>>>>>>> The sql below：
>>>>>>> select count(1) from (
>>>>>>>    select t2.a_id,t2.amount1,t2.amount2
>>>>>>>    from( select a_id from a) t1
>>>>>>>    left outer join (
>>>>>>>        select c.a_id,sum(amount) as amount1,sum(amount) as
>> amount2
>>>>>>>        from b join c  on b.b_id = c.b_id group by c.a_id) t2
>>>>>>>    on t1.a_id = t2.a_id
>>>>>>> ) t;
>>>>>>> +----------+
>>>>>>> | count(1) |
>>>>>>> +----------+
>>>>>>> | 1        |
>>>>>>> +----------+
>>>>>>> but if remove the "t2.amount2"  like this：
>>>>>>> select count(1) from (
>>>>>>>    select t2.a_id,t2.amount1
>>>>>>>    from( select a_id from a) t1
>>>>>>>    left outer join (
>>>>>>>        select c.a_id,sum(amount) as amount1,sum(amount) as
>> amount2
>>>>>>>        from b join c  on b.b_id = c.b_id group by c.a_id) t2
>>>>>>>    on t1.a_id = t2.a_id
>>>>>>> ) t;
>>>>>>> +----------+
>>>>>>> | count(1) |
>>>>>>> +----------+
>>>>>>> | 2        |
>>>>>>> +----------+
>>>>>>> Here is the result of two subquery without count(1)：
>>>>>>> +------+---------+---------+
>>>>>>> | a_id | amount1 | amount2 |
>>>>>>> +------+---------+---------+
>>>>>>> | 1    | 30      | 30      |
>>>>>>> | 2    | NULL    | NULL    |
>>>>>>> +------+---------+---------+             why the count(1) of this
>>>>>>> resultset is 1?
>>>>>>> +------+---------+
>>>>>>> | a_id | amount1 |
>>>>>>> +------+---------+
>>>>>>> | 1    | 30      |
>>>>>>> | 2    | NULL    |
>>>>>>> +------+---------+                           why the count(1) of
>>> this
>>>>>>> resultset is 2?
>>>>>>> I want to ask why the first sql return just 1， but second return
>>> 2，is
>>>>>> this
>>>>>>> correct or impala bug？How impala deal with count aggr.?
>>>>>>> If I change the sum to other aggr. function like count/max/min,
>>>> result
>>>>> is
>>>>>>> same. I test this on 2.12.0 and 3.1.0 version.
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>> 
>>> 
>>

Re: A problem of left outer join when statement contains two aggregation for the same column

Reply via email to