[ https://issues.apache.org/jira/browse/IMPALA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16965505#comment-16965505 ]
Antoni Ivanov edited comment on IMPALA-1369 at 11/2/19 9:43 PM: ---------------------------------------------------------------- Seems to still be an issue with the Impala version we use (2.9 at the moment). Performance of null-aware anti-join are comparable to those of the Nested loop join (way too slow). Are there any plans to address this ? Could this have been addressed with a fix in Impala 3? was (Author: tozka): Seems to still be an issue with the latest version of Impala. Performance of null-aware anti-join are comparable to those of the Nested loop join (way too slow). > Slow query with NAAJs > --------------------- > > Key: IMPALA-1369 > URL: https://issues.apache.org/jira/browse/IMPALA-1369 > Project: IMPALA > Issue Type: Improvement > Components: Perf Investigation > Affects Versions: Impala 2.0 > Reporter: Ippokratis Pandis > Priority: Minor > Labels: performance > > With IMPALA-1351 the query below runs correctly, but it takes too much time, > a few hours in my dev box. I think because essentially the NAAJs become > something like nested loop joins. Postgres runs it in some low tenths of > minutes. > Below it is the query, and the explains of both Impala and Postgres. > {code} > SELECT COUNT(t2.month) AS int_col_1 > FROM alltypesagg t1 CROSS > JOIN alltypesagg t2 FULL > OUTER JOIN alltypestiny t3 ON t3.year = t2.int_col > AND t3.int_col = t2.year > WHERE t1.day NOT IN > (SELECT tt1.year AS int_col_1 > FROM alltypes tt1 > RIGHT JOIN alltypes tt2 ON tt2.month = tt1.year > WHERE t3.tinyint_col = tt2.tinyint_col > AND t3.bigint_col = tt2.bigint_col > AND t3.int_col = tt1.id) > AND t1.day NOT IN > (SELECT ttt1.tinyint_col AS tinyint_col_1 > FROM alltypestiny ttt1 > WHERE t1.tinyint_col = ttt1.month > AND ttt1.month IS NULL > AND ttt1.tinyint_col NOT IN > (SELECT tttt2.int_col AS int_col_1 > FROM alltypesagg tttt1 > RIGHT JOIN alltypes tttt2 ON tttt2.id = tttt1.year) > AND ttt1.month NOT IN > (SELECT MIN(tttt2.year + tttt1.smallint_col * tttt1.int_col - > tttt2.bigint_col) + MAX(tttt2.tinyint_col) AS bigint_col_1 > FROM alltypestiny tttt1 > INNER JOIN alltypestiny tttt2 ON tttt2.month = tttt1.smallint_col) > AND ttt1.month IS NOT NULL > AND t2.smallint_col = ttt1.bigint_col) > {code} > The Impala explain" > {code} > +--------------------------------------------------------------------------------------------------------------------------------------+ > | Explain String > | > +--------------------------------------------------------------------------------------------------------------------------------------+ > | Estimated Per-Host Requirements: Memory=1.66GB VCores=5 > | > | > | > | 36:AGGREGATE [FINALIZE] > | > | | output: count:merge(t2.month) > | > | | > | > | 35:EXCHANGE [UNPARTITIONED] > | > | | > | > | 20:AGGREGATE > | > | | output: count(t2.month) > | > | | > | > | 19:HASH JOIN [NULL AWARE LEFT ANTI JOIN, BROADCAST] > | > | | hash predicates: t1.day = ttt1.tinyint_col > | > | | other join predicates: t2.smallint_col = ttt1.bigint_col, t1.tinyint_col > = ttt1.month | > | | > | > | |--34:EXCHANGE [BROADCAST] > | > | | | > | > | | 15:HASH JOIN [NULL AWARE LEFT ANTI JOIN, BROADCAST] > | > | | | hash predicates: ttt1.month = min(tttt2.year + tttt1.smallint_col * > tttt1.int_col - tttt2.bigint_col) + max(tttt2.tinyint_col) | > | | | > | > | | |--33:EXCHANGE [BROADCAST] > | > | | | | > | > | | | 32:AGGREGATE [FINALIZE] > | > | | | | output: min:merge(tttt2.year + tttt1.smallint_col * tttt1.int_col > - tttt2.bigint_col), max:merge(tttt2.tinyint_col) | > | | | | > | > | | | 31:EXCHANGE [UNPARTITIONED] > | > | | | | > | > | | | 13:AGGREGATE > | > | | | | output: min(tttt2.year + tttt1.smallint_col * tttt1.int_col - > tttt2.bigint_col), max(tttt2.tinyint_col) | > | | | | > | > | | | 12:HASH JOIN [INNER JOIN, BROADCAST] > | > | | | | hash predicates: tttt2.month = tttt1.smallint_col > | > | | | | > | > | | | |--30:EXCHANGE [BROADCAST] > | > | | | | | > | > | | | | 10:SCAN HDFS [functional.alltypestiny tttt1] > | > | | | | partitions=4/4 size=460B > | > | | | | > | > | | | 11:SCAN HDFS [functional.alltypestiny tttt2] > | > | | | partitions=4/4 size=460B > | > | | | > | > | | 14:HASH JOIN [NULL AWARE LEFT ANTI JOIN, BROADCAST] > | > | | | hash predicates: ttt1.tinyint_col = tttt2.int_col > | > | | | > | > | | |--29:EXCHANGE [BROADCAST] > | > | | | | > | > | | | 09:HASH JOIN [LEFT OUTER JOIN, PARTITIONED] > | > | | | | hash predicates: tttt2.id = tttt1.year > | > | | | | > | > | | | |--28:EXCHANGE [HASH(tttt1.year)] > | > | | | | | > | > | | | | 07:SCAN HDFS [functional.alltypesagg tttt1] > | > | | | | partitions=11/11 size=814.73KB > | > | | | | > | > | | | 27:EXCHANGE [HASH(tttt2.id)] > | > | | | | > | > | | | 08:SCAN HDFS [functional.alltypes tttt2] > | > | | | partitions=24/24 size=478.45KB > | > | | | > | > | | 06:SCAN HDFS [functional.alltypestiny ttt1] > | > | | partitions=0/4 size=0B > | > | | > | > | 18:HASH JOIN [NULL AWARE LEFT ANTI JOIN, BROADCAST] > | > | | hash predicates: t1.day = tt1.year > | > | | other join predicates: t3.int_col = tt1.id, t3.bigint_col = > tt2.bigint_col, t3.tinyint_col = tt2.tinyint_col | > | | > | > | |--26:EXCHANGE [BROADCAST] > | > | | | > | > | | 05:HASH JOIN [LEFT OUTER JOIN, PARTITIONED] > | > | | | hash predicates: tt2.month = tt1.year > | > | | | > | > | | |--25:EXCHANGE [HASH(tt1.year)] > | > | | | | > | > | | | 03:SCAN HDFS [functional.alltypes tt1] > | > | | | partitions=24/24 size=478.45KB > | > | | | > | > | | 24:EXCHANGE [HASH(tt2.month)] > | > | | | > | > | | 04:SCAN HDFS [functional.alltypes tt2] > | > | | partitions=24/24 size=478.45KB > | > | | > | > | 17:HASH JOIN [FULL OUTER JOIN, PARTITIONED] > | > | | hash predicates: t2.int_col = t3.year, t2.year = t3.int_col > | > | | > | > | |--23:EXCHANGE [HASH(t3.year,t3.int_col)] > | > | | | > | > | | 02:SCAN HDFS [functional.alltypestiny t3] > | > | | partitions=4/4 size=460B > | > | | > | > | 22:EXCHANGE [HASH(t2.int_col,t2.year)] > | > | | > | > | 16:CROSS JOIN [BROADCAST] > | > | | > | > | |--21:EXCHANGE [BROADCAST] > | > | | | > | > | | 00:SCAN HDFS [functional.alltypesagg t1] > | > | | partitions=11/11 size=814.73KB > | > | | > | > | 01:SCAN HDFS [functional.alltypesagg t2] > | > | partitions=11/11 size=814.73KB > | > +--------------------------------------------------------------------------------------------------------------------------------------+ > {code} > And Postgres's simpler plan, which uses a NLIJ: > {code} > ------------ > Aggregate (cost=104609484.78..104609484.79 rows=1 width=4) > -> Hash Full Join (cost=1.20..104533859.78 rows=30250000 width=4) > Hash Cond: ((t2.int_col = t3.year) AND (t2.year = t3.int_col)) > Filter: ((NOT (SubPlan 1)) AND (NOT (SubPlan 4))) > -> Nested Loop (cost=0.00..1513041.50 rows=121000000 width=23) > -> Seq Scan on alltypesagg t1 (cost=0.00..257.00 rows=11000 > width=9) > -> Materialize (cost=0.00..312.00 rows=11000 width=14) > -> Seq Scan on alltypesagg t2 (cost=0.00..257.00 > rows=11000 width=14) > -> Hash (cost=1.08..1.08 rows=8 width=28) > -> Seq Scan on alltypestiny t3 (cost=0.00..1.08 rows=8 > width=28) > SubPlan 1 > -> Nested Loop (cost=13.13..122.84 rows=1 width=4) > -> Index Scan using functional_alltypes_id on alltypes tt1 > (cost=0.28..8.30 rows=1 width=4) > Index Cond: (t3.int_col = id) > -> Bitmap Heap Scan on alltypes tt2 (cost=12.84..114.48 > rows=6 width=4) > Recheck Cond: (month = tt1.year) > Filter: ((t3.tinyint_col = tinyint_col) AND > (t3.bigint_col = bigint_col)) > -> Bitmap Index Scan on functional_alltypes_month > (cost=0.00..12.84 rows=608 width=0) > Index Cond: (month = tt1.year) > SubPlan 4 > -> Seq Scan on alltypestiny ttt1 (cost=775.99..777.21 rows=1 > width=12) > Filter: ((month IS NULL) AND (month IS NOT NULL) AND > (t2.smallint_col = bigint_col) AND (NOT (hashed SubPlan 2)) AND (NOT (hashed > SubPlan 3)) AND (t1.tinyint_col = (month) > ::numeric)) > SubPlan 2 > -> Hash Right Join (cost=255.25..746.00 rows=11000 > width=4) > Hash Cond: (tttt1.year = tttt2.id) > -> Seq Scan on alltypesagg tttt1 > (cost=0.00..257.00 rows=11000 width=4) > -> Hash (cost=164.00..164.00 rows=7300 width=8) > -> Seq Scan on alltypes tttt2 > (cost=0.00..164.00 rows=7300 width=8) > SubPlan 3 > -> Aggregate (cost=2.47..2.48 rows=1 width=30) > -> Hash Join (cost=1.18..2.37 rows=8 width=30) > Hash Cond: (tttt1_1.smallint_col = > tttt2_1.month) > -> Seq Scan on alltypestiny tttt1_1 > (cost=0.00..1.08 rows=8 width=6) > -> Hash (cost=1.08..1.08 rows=8 width=28) > -> Seq Scan on alltypestiny tttt2_1 > (cost=0.00..1.08 rows=8 width=28) > (35 rows) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org