Hi Paul, I think you were tracking https://issues.apache.org/jira/browse/IMPALA-7957 where the predicate "c1 = c2" in the WHERE clause of inline view t2 is wrongly duplicated to an upper SELECT node. Its cause differs from IMPALA-8386 in where a wrong identity predicate "amount = amount" is generated and wrongly rejects nulls. However, while digging deeper, I agree with you that the root cause of these two cases may be the same: we should distinguish different kinds of predicates and use them carefully.
Take this query as an example: select 1 from functional.alltypes a inner join (select id+id as x, tinyint_col, int_col*int_col as y, bigint_col from functional.alltypessmall) b on a.id = b.x and a.id = b.tinyint_col and a.int_col = b.y and a.int_col = b.bigint_col In the sematic analysis phase (happens in Analyzer), some predicates are generated from the original query (e.g. "a.id = b.x", "a.id = b.tinyint_col"). Some are generated for slot equivalence between inline view and the outer scope (e.g. "b.x = id + id", "b.tinyint_col = tinyint_col". Let's call them "alias predicates"). Then all these predicates will be used to generate a slot equivalence graph (i.e. Analyzer.valueTransferGraph. Vertices are slots. Edges mean equivilance). Then in the single node plan generation phase (happens in SingleNodePlanner), Impala will create some predicates base on slot equivalence (e.g. "id + id = tinyint_col" in the inline view b). This acts as optimizations like pushing down predicates of the outer scope into the inline view. If the generated predicates from the outer scope can't be pushed down to the inline view (due to LIMIT/OFFSET clauses or analytic functions existence in the inline view like the following query), Impala will create a SELECT node with these generated predicates on top of the inline view plan (happens in SingleNodePlanner#addUnassignedConjuncts, calling Analyzer#createEquivConjuncts). select 1 from functional.alltypes a inner join (select id+id as x, tinyint_col, int_col*int_col as y, bigint_col from functional.alltypessmall order by tinyint_col *limit 20*) b on a.id = b.x and a.id = b.tinyint_col and a.int_col = b.y and a.int_col = b.bigint_col The test cases in these commits better illustrate the behaviors: https://github.com/apache/impala/commit/0752b9fa624d3588f3b15dbb5d1bc60d517412f8 https://github.com/apache/impala/commit/c8e928119d2787999542e34ca32ccec8d3a82a77 Back to our problems, I think the root cause is some information is lost in the valueTranferGraph so we can't assure these: * Predicates inferred by only "alias predicates" should not be created. They will definitely introduce identity predicates. Although we filter them out in SingleNodePlanner#migrateConjunctsToInlineView, the filter logics have exceptions which cause IMPALA-8386. My patch ( https://gerrit.cloudera.org/c/12939/) can just fix the exceptions. It'd be better if we could prevent suck kind of predicates being generated. * Predicates inferred by only inner scope predicates should not be assigned to the outer scope (the opposite direction is ok and act as predicate pushdown). This causes IMPALA-7957. I'm still looking for a thorough solution. Just updates my progress and welcome more discussions! Thanks, Quanlong On Sat, Apr 6, 2019 at 4:48 AM Paul Rogers <prog...@cloudera.com> wrote: > Hi Quanlong, > > This is a tricky issue. Your query is an outer join. In this case, Impala > has logic to “repeat” scan predicates at the outer join level: > > SELECT a, b > FROM t1 outer t2 on (t1.id <http://t1.id/> = t2.id <http://t2.id/>) > WHERE t2.c = 10 > > This will cause the predicate “t2.c = 10” to appear both in the scan for > t2 and in the JOIN node for the outer join. Why? Because we should produce > the same result whether we filter before or after the join, an outer join > can introduce a null value for t2.c, and so we need to apply the predicate > a second time after the join to enforce the filter rule. > > In your case, two parts of Impala collide. The first is the dubious use of > predicates to track aliases. That is, in your query, Impala will create a > predicate “c.a_id = t1.a_id” to express the equivalence of the “two” > columns. This is dubious because aliases are just names; they don’t > actually introduce a new column. > > Said another way, an alias is not the same a join-equivalence, though > Impala seems to attempt to treat them as such. > > In any event, the code that handles “unassigned” outer join predicates > also seems to get triggered for the “phantom” alias predicates, as in your > case. > > I looked into this issue at one point, found a workaround, and moved onto > other things before I could track the issue down to its root cause in the > code. Perhaps your fix sheds light on how to make this bit of Impala’s > implementation work in the general case. > > Thanks, > > - Paul > > > On Apr 5, 2019, at 10:22 AM, Quanlong Huang <huangquanl...@gmail.com> > wrote: > > > > I use another query to reproduce the bug: > > explain select * from (select t2.a_id,t2.amount1,t2.amount2 > > from a > > left outer join ( > > select c.a_id, amount as amount1, amount as amount2 > > from b join c on b.b_id = c.b_id) t2 > > on a.a_id = t2.a_id) t1 > > > > The creation of the "amount = amount" predicate should be traced to > > SingleNodePlanner#migrateConjunctsToInlineView when dealing with inline > > view t1. It's an optimization that Impala will create some predicates > based > > on slot equivalences (i.e. Analyzer#valueTransferGraph). However, these > > conjuncts may be identity (i.e. things like a = a) which may incorrectly > > reject rows with nulls. We already have some logic to remove this kind of > > conjuncts but the existing checks have exceptions. It's due to the wrong > > substitution map is used in checking identity predicates. > > > > Here's a patch to fix it: https://gerrit.cloudera.org/c/12939/ > > > > On Fri, Apr 5, 2019 at 3:40 AM Tim Armstrong <tarmstr...@cloudera.com> > > wrote: > > > >> This also reminds me of > https://issues.apache.org/jira/browse/IMPALA-7957 > >> since there's an extra predicate added after the LEFT JOIN. > >> > >> On Thu, Apr 4, 2019 at 8:37 AM Quanlong Huang <huangquanl...@gmail.com> > >> wrote: > >> > >>> Yes. I'm interested in this and going to look deeper in it tomorrow. > Just > >>> filed a JIRA: https://issues.apache.org/jira/browse/IMPALA-8386 > >>> > >>> On Thu, Apr 4, 2019 at 11:20 PM Todd Lipcon <t...@cloudera.com> wrote: > >>> > >>>> Sounds like we should file a high priority JIRA (any "wrong results" > >> bugs > >>>> should probably be considered critical or blocker). Quanlong, any > >>> interest > >>>> in working on this issue? > >>>> > >>>> -Todd > >>>> > >>>> On Thu, Apr 4, 2019 at 8:17 AM Quanlong Huang < > huangquanl...@gmail.com > >>> > >>>> wrote: > >>>> > >>>>> +1 for Csaba's analysis. Looks like similiar to this > >>>>> https://issues.apache.org/jira/browse/IMPALA-3126 > >>>>> > >>>>> On Thu, Apr 4, 2019 at 11:08 PM Csaba Ringhofer < > >>>> csringho...@cloudera.com> > >>>>> wrote: > >>>>> > >>>>>> Hi! > >>>>>> > >>>>>> I have checked the queries, and I can verify that Impala > >> incorrectly > >>>>>> returns 1 row while the same query with Hive (or common sense..) > >>>> returns > >>>>> 2 > >>>>>> rows. > >>>>>> > >>>>>>> "but if remove the "t2.amount2" like this:" > >>>>>> Indeed, the issue seems to be related to returning the same > >> aggregate > >>>>> twice > >>>>>> + the fact that one of these values is NULL. The planner > >> introduces a > >>>>>> predicate that checks if amount1=amount2, which is false, if both > >>>> values > >>>>>> are NULL: > >>>>>> > >>>>>> explain select * from (select t2.a_id,t2.amount1,t2.amount2 > >>>>>> from( select a_id from a) t1 > >>>>>> left outer join ( > >>>>>> select c.a_id,sum(amount) as amount1,sum(amount) as amount2 > >>>>>> from b join c on b.b_id = c.b_id group by c.a_id) t2 > >>>>>> on t1.a_id = t2.a_id) t; > >>>>>> > >>>>>> results in: > >>>>>> PLAN-ROOT SINK > >>>>>> | > >>>>>> 05:HASH JOIN [RIGHT OUTER JOIN] > >>>>>> | hash predicates: c.a_id = a_id > >>>>>> | other predicates: sum(amount) = sum(amount) <----- I don't know > >>> why > >>>>> this > >>>>>> predicate is added. > >>>>>> | runtime filters: RF000 <- a_id > >>>>>> | row-size=16B cardinality=2 > >>>>>> .... > >>>>>> > >>>>>> If I EXPLAIN the query without the outer select, the sum(amount) = > >>>>>> sum(amount) is not added, which explains the difference. > >>>>>> > >>>>>> I do not know why the planner adds this predicate, my guess is that > >>>> this > >>>>> is > >>>>>> some kind of bug in Impala. > >>>>>> > >>>>>> > >>>>>> On Thu, Apr 4, 2019 at 2:27 PM skyyws <sky...@163.com> wrote: > >>>>>> > >>>>>>> Hi all, I met a problem of left outer join recently, and I > >>> reproduce > >>>>> this > >>>>>>> problem by some simple test data with three tables a, b, c: > >>>>>>> table A > >>>>>>> +------+ > >>>>>>> | a_id | > >>>>>>> +------+ > >>>>>>> | 1 | > >>>>>>> | 2 | > >>>>>>> +------+ > >>>>>>> table B > >>>>>>> +------+--------+ > >>>>>>> | b_id | amount | > >>>>>>> +------+--------+ > >>>>>>> | 1 | 10 | > >>>>>>> | 1 | 20 | > >>>>>>> | 2 | NULL | > >>>>>>> +------+--------+ > >>>>>>> table C > >>>>>>> +------+------+ > >>>>>>> | a_id | b_id | > >>>>>>> +------+------+ > >>>>>>> | 1 | 1 | > >>>>>>> | 2 | 2 | > >>>>>>> +------+------+ > >>>>>>> The sql below: > >>>>>>> select count(1) from ( > >>>>>>> select t2.a_id,t2.amount1,t2.amount2 > >>>>>>> from( select a_id from a) t1 > >>>>>>> left outer join ( > >>>>>>> select c.a_id,sum(amount) as amount1,sum(amount) as > >> amount2 > >>>>>>> from b join c on b.b_id = c.b_id group by c.a_id) t2 > >>>>>>> on t1.a_id = t2.a_id > >>>>>>> ) t; > >>>>>>> +----------+ > >>>>>>> | count(1) | > >>>>>>> +----------+ > >>>>>>> | 1 | > >>>>>>> +----------+ > >>>>>>> but if remove the "t2.amount2" like this: > >>>>>>> select count(1) from ( > >>>>>>> select t2.a_id,t2.amount1 > >>>>>>> from( select a_id from a) t1 > >>>>>>> left outer join ( > >>>>>>> select c.a_id,sum(amount) as amount1,sum(amount) as > >> amount2 > >>>>>>> from b join c on b.b_id = c.b_id group by c.a_id) t2 > >>>>>>> on t1.a_id = t2.a_id > >>>>>>> ) t; > >>>>>>> +----------+ > >>>>>>> | count(1) | > >>>>>>> +----------+ > >>>>>>> | 2 | > >>>>>>> +----------+ > >>>>>>> Here is the result of two subquery without count(1): > >>>>>>> +------+---------+---------+ > >>>>>>> | a_id | amount1 | amount2 | > >>>>>>> +------+---------+---------+ > >>>>>>> | 1 | 30 | 30 | > >>>>>>> | 2 | NULL | NULL | > >>>>>>> +------+---------+---------+ why the count(1) of this > >>>>>>> resultset is 1? > >>>>>>> +------+---------+ > >>>>>>> | a_id | amount1 | > >>>>>>> +------+---------+ > >>>>>>> | 1 | 30 | > >>>>>>> | 2 | NULL | > >>>>>>> +------+---------+ why the count(1) of > >>> this > >>>>>>> resultset is 2? > >>>>>>> I want to ask why the first sql return just 1, but second return > >>> 2,is > >>>>>> this > >>>>>>> correct or impala bug?How impala deal with count aggr.? > >>>>>>> If I change the sum to other aggr. function like count/max/min, > >>>> result > >>>>> is > >>>>>>> same. I test this on 2.12.0 and 3.1.0 version. > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >>>> -- > >>>> Todd Lipcon > >>>> Software Engineer, Cloudera > >>>> > >>> > >> > >