After examining more codes, I have some updates for these two bugs: --------------------- IMPALA-8386 ---------------------
>Predicates inferred by only "alias predicates" should not be created. They will definitely introduce identity predicates. Although we filter them out in SingleNodePlanner#migrateConjunctsToInlineView, the filter logics have exceptions which cause IMPALA-8386. My patch https://gerrit.cloudera.org/c/12939 can just fix the exceptions. It'd be better if we could prevent suck kind of predicates being generated. Predicates created from Analyzer#createEquivConjuncts won't be identities directly. But they can be substituted into identities. The substitution only happens in SingleNodePlanner#migrateConjunctsToInlineView where we migrate the created predicates plus those unassigned predicates into the inline view. So we just need to check potential identity predicates in SingleNodePlanner#migrateConjunctsToInlineView after calling Analyzer#createEquivConjuncts. For the checking, we need to resolve those predicates into base table slots predicates. So I think patch https://gerrit.cloudera.org/c/12939 is sufficient to fix IMPALA-8386. --------------------- IMPALA-7957 --------------------- >Predicates inferred by only inner scope predicates should not be assigned to the outer scope (the opposite direction is ok and act as predicate pushdown). This causes IMPALA-7957. My statement in the last email is wrong. The problem is that slot equivalences are enforced several times. Take this simpler query to reproduce the bug: SELECT a.id FROM functional.alltypestiny a LEFT JOIN (SELECT id, int_col FROM functional.alltypestiny b WHERE int_col = id) t ON (a.id = t.id) UNION ALL VALUES (NULL) Without the "UNION ALL VALUES (NULL)" clause, the plan is correct. With the UNION clause, the plan has a SELECT node executing the predicate "b.id = b.int_col" on top of the LEFT JOIN node, which wrongly rejects rows containing nulls: +------------------------------------------------------------+ | Explain String | +------------------------------------------------------------+ | Max Per-Host Resource Reservation: Memory=1.95MB Threads=5 | | Per-Host Resource Estimates: Memory=162MB | | Codegen disabled by planner | | | | PLAN-ROOT SINK | | | | | 06:EXCHANGE [UNPARTITIONED] | | | | | 00:UNION | | | constant-operands=1 | | | row-size=4B cardinality=2 | | | | | 04:SELECT | | | predicates: b.id = b.int_col <-------------------------- Wrong here! | | row-size=12B cardinality=1 | | | | | 03:HASH JOIN [LEFT OUTER JOIN, BROADCAST] | | | hash predicates: a.id = id | | | row-size=12B cardinality=8 | | | | | |--05:EXCHANGE [BROADCAST] | | | | | | | 02:SCAN HDFS [functional.alltypestiny b] | | | partitions=4/4 files=4 size=460B | | | predicates: int_col = id | | | row-size=8B cardinality=1 | | | | | 01:SCAN HDFS [functional.alltypestiny a] | | partitions=4/4 files=4 size=460B | | row-size=4B cardinality=8 | +------------------------------------------------------------+ Here are how this happen: * After semantic analysis, there're 4 predicates. Two of them are from the query: "b.int_col = b.id", "a.id = t.id". The other two are auxiliary predicates generated for slot equavilance: "t.id = b.id", "t.int_col = b.int_col". * When migrating conjuncts into inline view t (happens in SingleNodePlanner#migrateConjunctsToInlineView), a predicate "t.id = t.int_col" is inferred. Then it's substituted (migrated) to "b.id = b.int_col" and assigned to the scan node for b. When creating this scan node, the redundant inferred predicate "b.id = b.int_col" is removed. (It's redundant to the origin predicate "b.int_col = b.id"). * At the last step of creating the UNION operand for the JOIN side, if there are still unassigned predicates, or we can still inferred some predicates, the planner will create a SELECT node with those predicates on top of the UNION operand. (See SingleNodePlanner#addUnassignedConjuncts). Unfortunately, a predicate "b.id = b.int_col" is inferred again here thus creating the incorrect SELECT node. I just uploaded a patch for review: https://gerrit.cloudera.org/c/13051 . The trace logs it adds can help to reveal this. Can anyone help to review these two patches? * https://gerrit.cloudera.org/c/12939 * https://gerrit.cloudera.org/c/13051 Thanks, Quanlong On Mon, Apr 8, 2019 at 12:02 PM Quanlong Huang <huangquanl...@gmail.com> wrote: > Hi Paul, > > I think you were tracking > https://issues.apache.org/jira/browse/IMPALA-7957 where the predicate "c1 > = c2" in the WHERE clause of inline view t2 is wrongly duplicated to an > upper SELECT node. Its cause differs from IMPALA-8386 in where a wrong > identity predicate "amount = amount" is generated and wrongly rejects > nulls. However, while digging deeper, I agree with you that the root cause > of these two cases may be the same: we should distinguish different kinds > of predicates and use them carefully. > > Take this query as an example: > > select 1 > from functional.alltypes a > inner join > (select id+id as x, tinyint_col, > int_col*int_col as y, bigint_col > from functional.alltypessmall) b > on a.id = b.x and a.id = b.tinyint_col and > a.int_col = b.y and a.int_col = b.bigint_col > > In the sematic analysis phase (happens in Analyzer), some predicates are > generated from the original query (e.g. "a.id = b.x", "a.id = > b.tinyint_col"). Some are generated for slot equivalence between inline > view and the outer scope (e.g. "b.x = id + id", "b.tinyint_col = > tinyint_col". Let's call them "alias predicates"). Then all these > predicates will be used to generate a slot equivalence graph (i.e. > Analyzer.valueTransferGraph. Vertices are slots. Edges mean equivilance). > > Then in the single node plan generation phase (happens in > SingleNodePlanner), Impala will create some predicates base on slot > equivalence (e.g. "id + id = tinyint_col" in the inline view b). This acts > as optimizations like pushing down predicates of the outer scope into the > inline view. If the generated predicates from the outer scope can't be > pushed down to the inline view (due to LIMIT/OFFSET clauses or analytic > functions existence in the inline view like the following query), Impala > will create a SELECT node with these generated predicates on top of the > inline view plan (happens in SingleNodePlanner#addUnassignedConjuncts, > calling Analyzer#createEquivConjuncts). > > select 1 > from functional.alltypes a > inner join > (select id+id as x, tinyint_col, > int_col*int_col as y, bigint_col > from functional.alltypessmall > order by tinyint_col *limit 20*) b > on a.id = b.x and a.id = b.tinyint_col and > a.int_col = b.y and a.int_col = b.bigint_col > > The test cases in these commits better illustrate the behaviors: > > https://github.com/apache/impala/commit/0752b9fa624d3588f3b15dbb5d1bc60d517412f8 > > https://github.com/apache/impala/commit/c8e928119d2787999542e34ca32ccec8d3a82a77 > > Back to our problems, I think the root cause is some information is lost > in the valueTranferGraph so we can't assure these: > * Predicates inferred by only "alias predicates" should not be created. > They will definitely introduce identity predicates. Although we filter them > out in SingleNodePlanner#migrateConjunctsToInlineView, the filter logics > have exceptions which cause IMPALA-8386. My patch ( > https://gerrit.cloudera.org/c/12939/) can just fix the exceptions. It'd > be better if we could prevent suck kind of predicates being generated. > * Predicates inferred by only inner scope predicates should not be > assigned to the outer scope (the opposite direction is ok and act as > predicate pushdown). This causes IMPALA-7957. > > I'm still looking for a thorough solution. Just updates my progress and > welcome more discussions! > > Thanks, > Quanlong > > On Sat, Apr 6, 2019 at 4:48 AM Paul Rogers <prog...@cloudera.com> wrote: > >> Hi Quanlong, >> >> This is a tricky issue. Your query is an outer join. In this case, Impala >> has logic to “repeat” scan predicates at the outer join level: >> >> SELECT a, b >> FROM t1 outer t2 on (t1.id <http://t1.id/> = t2.id <http://t2.id/>) >> WHERE t2.c = 10 >> >> This will cause the predicate “t2.c = 10” to appear both in the scan for >> t2 and in the JOIN node for the outer join. Why? Because we should produce >> the same result whether we filter before or after the join, an outer join >> can introduce a null value for t2.c, and so we need to apply the predicate >> a second time after the join to enforce the filter rule. >> >> In your case, two parts of Impala collide. The first is the dubious use >> of predicates to track aliases. That is, in your query, Impala will create >> a predicate “c.a_id = t1.a_id” to express the equivalence of the “two” >> columns. This is dubious because aliases are just names; they don’t >> actually introduce a new column. >> >> Said another way, an alias is not the same a join-equivalence, though >> Impala seems to attempt to treat them as such. >> >> In any event, the code that handles “unassigned” outer join predicates >> also seems to get triggered for the “phantom” alias predicates, as in your >> case. >> >> I looked into this issue at one point, found a workaround, and moved onto >> other things before I could track the issue down to its root cause in the >> code. Perhaps your fix sheds light on how to make this bit of Impala’s >> implementation work in the general case. >> >> Thanks, >> >> - Paul >> >> > On Apr 5, 2019, at 10:22 AM, Quanlong Huang <huangquanl...@gmail.com> >> wrote: >> > >> > I use another query to reproduce the bug: >> > explain select * from (select t2.a_id,t2.amount1,t2.amount2 >> > from a >> > left outer join ( >> > select c.a_id, amount as amount1, amount as amount2 >> > from b join c on b.b_id = c.b_id) t2 >> > on a.a_id = t2.a_id) t1 >> > >> > The creation of the "amount = amount" predicate should be traced to >> > SingleNodePlanner#migrateConjunctsToInlineView when dealing with inline >> > view t1. It's an optimization that Impala will create some predicates >> based >> > on slot equivalences (i.e. Analyzer#valueTransferGraph). However, these >> > conjuncts may be identity (i.e. things like a = a) which may incorrectly >> > reject rows with nulls. We already have some logic to remove this kind >> of >> > conjuncts but the existing checks have exceptions. It's due to the wrong >> > substitution map is used in checking identity predicates. >> > >> > Here's a patch to fix it: https://gerrit.cloudera.org/c/12939/ >> > >> > On Fri, Apr 5, 2019 at 3:40 AM Tim Armstrong <tarmstr...@cloudera.com> >> > wrote: >> > >> >> This also reminds me of >> https://issues.apache.org/jira/browse/IMPALA-7957 >> >> since there's an extra predicate added after the LEFT JOIN. >> >> >> >> On Thu, Apr 4, 2019 at 8:37 AM Quanlong Huang <huangquanl...@gmail.com >> > >> >> wrote: >> >> >> >>> Yes. I'm interested in this and going to look deeper in it tomorrow. >> Just >> >>> filed a JIRA: https://issues.apache.org/jira/browse/IMPALA-8386 >> >>> >> >>> On Thu, Apr 4, 2019 at 11:20 PM Todd Lipcon <t...@cloudera.com> >> wrote: >> >>> >> >>>> Sounds like we should file a high priority JIRA (any "wrong results" >> >> bugs >> >>>> should probably be considered critical or blocker). Quanlong, any >> >>> interest >> >>>> in working on this issue? >> >>>> >> >>>> -Todd >> >>>> >> >>>> On Thu, Apr 4, 2019 at 8:17 AM Quanlong Huang < >> huangquanl...@gmail.com >> >>> >> >>>> wrote: >> >>>> >> >>>>> +1 for Csaba's analysis. Looks like similiar to this >> >>>>> https://issues.apache.org/jira/browse/IMPALA-3126 >> >>>>> >> >>>>> On Thu, Apr 4, 2019 at 11:08 PM Csaba Ringhofer < >> >>>> csringho...@cloudera.com> >> >>>>> wrote: >> >>>>> >> >>>>>> Hi! >> >>>>>> >> >>>>>> I have checked the queries, and I can verify that Impala >> >> incorrectly >> >>>>>> returns 1 row while the same query with Hive (or common sense..) >> >>>> returns >> >>>>> 2 >> >>>>>> rows. >> >>>>>> >> >>>>>>> "but if remove the "t2.amount2" like this:" >> >>>>>> Indeed, the issue seems to be related to returning the same >> >> aggregate >> >>>>> twice >> >>>>>> + the fact that one of these values is NULL. The planner >> >> introduces a >> >>>>>> predicate that checks if amount1=amount2, which is false, if both >> >>>> values >> >>>>>> are NULL: >> >>>>>> >> >>>>>> explain select * from (select t2.a_id,t2.amount1,t2.amount2 >> >>>>>> from( select a_id from a) t1 >> >>>>>> left outer join ( >> >>>>>> select c.a_id,sum(amount) as amount1,sum(amount) as amount2 >> >>>>>> from b join c on b.b_id = c.b_id group by c.a_id) t2 >> >>>>>> on t1.a_id = t2.a_id) t; >> >>>>>> >> >>>>>> results in: >> >>>>>> PLAN-ROOT SINK >> >>>>>> | >> >>>>>> 05:HASH JOIN [RIGHT OUTER JOIN] >> >>>>>> | hash predicates: c.a_id = a_id >> >>>>>> | other predicates: sum(amount) = sum(amount) <----- I don't know >> >>> why >> >>>>> this >> >>>>>> predicate is added. >> >>>>>> | runtime filters: RF000 <- a_id >> >>>>>> | row-size=16B cardinality=2 >> >>>>>> .... >> >>>>>> >> >>>>>> If I EXPLAIN the query without the outer select, the sum(amount) = >> >>>>>> sum(amount) is not added, which explains the difference. >> >>>>>> >> >>>>>> I do not know why the planner adds this predicate, my guess is that >> >>>> this >> >>>>> is >> >>>>>> some kind of bug in Impala. >> >>>>>> >> >>>>>> >> >>>>>> On Thu, Apr 4, 2019 at 2:27 PM skyyws <sky...@163.com> wrote: >> >>>>>> >> >>>>>>> Hi all, I met a problem of left outer join recently, and I >> >>> reproduce >> >>>>> this >> >>>>>>> problem by some simple test data with three tables a, b, c: >> >>>>>>> table A >> >>>>>>> +------+ >> >>>>>>> | a_id | >> >>>>>>> +------+ >> >>>>>>> | 1 | >> >>>>>>> | 2 | >> >>>>>>> +------+ >> >>>>>>> table B >> >>>>>>> +------+--------+ >> >>>>>>> | b_id | amount | >> >>>>>>> +------+--------+ >> >>>>>>> | 1 | 10 | >> >>>>>>> | 1 | 20 | >> >>>>>>> | 2 | NULL | >> >>>>>>> +------+--------+ >> >>>>>>> table C >> >>>>>>> +------+------+ >> >>>>>>> | a_id | b_id | >> >>>>>>> +------+------+ >> >>>>>>> | 1 | 1 | >> >>>>>>> | 2 | 2 | >> >>>>>>> +------+------+ >> >>>>>>> The sql below: >> >>>>>>> select count(1) from ( >> >>>>>>> select t2.a_id,t2.amount1,t2.amount2 >> >>>>>>> from( select a_id from a) t1 >> >>>>>>> left outer join ( >> >>>>>>> select c.a_id,sum(amount) as amount1,sum(amount) as >> >> amount2 >> >>>>>>> from b join c on b.b_id = c.b_id group by c.a_id) t2 >> >>>>>>> on t1.a_id = t2.a_id >> >>>>>>> ) t; >> >>>>>>> +----------+ >> >>>>>>> | count(1) | >> >>>>>>> +----------+ >> >>>>>>> | 1 | >> >>>>>>> +----------+ >> >>>>>>> but if remove the "t2.amount2" like this: >> >>>>>>> select count(1) from ( >> >>>>>>> select t2.a_id,t2.amount1 >> >>>>>>> from( select a_id from a) t1 >> >>>>>>> left outer join ( >> >>>>>>> select c.a_id,sum(amount) as amount1,sum(amount) as >> >> amount2 >> >>>>>>> from b join c on b.b_id = c.b_id group by c.a_id) t2 >> >>>>>>> on t1.a_id = t2.a_id >> >>>>>>> ) t; >> >>>>>>> +----------+ >> >>>>>>> | count(1) | >> >>>>>>> +----------+ >> >>>>>>> | 2 | >> >>>>>>> +----------+ >> >>>>>>> Here is the result of two subquery without count(1): >> >>>>>>> +------+---------+---------+ >> >>>>>>> | a_id | amount1 | amount2 | >> >>>>>>> +------+---------+---------+ >> >>>>>>> | 1 | 30 | 30 | >> >>>>>>> | 2 | NULL | NULL | >> >>>>>>> +------+---------+---------+ why the count(1) of this >> >>>>>>> resultset is 1? >> >>>>>>> +------+---------+ >> >>>>>>> | a_id | amount1 | >> >>>>>>> +------+---------+ >> >>>>>>> | 1 | 30 | >> >>>>>>> | 2 | NULL | >> >>>>>>> +------+---------+ why the count(1) of >> >>> this >> >>>>>>> resultset is 2? >> >>>>>>> I want to ask why the first sql return just 1, but second return >> >>> 2,is >> >>>>>> this >> >>>>>>> correct or impala bug?How impala deal with count aggr.? >> >>>>>>> If I change the sum to other aggr. function like count/max/min, >> >>>> result >> >>>>> is >> >>>>>>> same. I test this on 2.12.0 and 3.1.0 version. >> >>>>>>> >> >>>>>>> >> >>>>>> >> >>>>> >> >>>> >> >>>> >> >>>> -- >> >>>> Todd Lipcon >> >>>> Software Engineer, Cloudera >> >>>> >> >>> >> >> >> >>