[
https://issues.apache.org/jira/browse/IMPALA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16965505#comment-16965505
]
Antoni Ivanov edited comment on IMPALA-1369 at 11/2/19 9:43 PM:
----------------------------------------------------------------
Seems to still be an issue with the Impala version we use (2.9 at the moment).
Performance of null-aware anti-join are comparable to those of the Nested loop
join (way too slow).
Are there any plans to address this ? Could this have been addressed with a fix
in Impala 3?
was (Author: tozka):
Seems to still be an issue with the latest version of Impala. Performance of
null-aware anti-join are comparable to those of the Nested loop join (way too
slow).
> Slow query with NAAJs
> ---------------------
>
> Key: IMPALA-1369
> URL: https://issues.apache.org/jira/browse/IMPALA-1369
> Project: IMPALA
> Issue Type: Improvement
> Components: Perf Investigation
> Affects Versions: Impala 2.0
> Reporter: Ippokratis Pandis
> Priority: Minor
> Labels: performance
>
> With IMPALA-1351 the query below runs correctly, but it takes too much time,
> a few hours in my dev box. I think because essentially the NAAJs become
> something like nested loop joins. Postgres runs it in some low tenths of
> minutes.
> Below it is the query, and the explains of both Impala and Postgres.
> {code}
> SELECT COUNT(t2.month) AS int_col_1
> FROM alltypesagg t1 CROSS
> JOIN alltypesagg t2 FULL
> OUTER JOIN alltypestiny t3 ON t3.year = t2.int_col
> AND t3.int_col = t2.year
> WHERE t1.day NOT IN
> (SELECT tt1.year AS int_col_1
> FROM alltypes tt1
> RIGHT JOIN alltypes tt2 ON tt2.month = tt1.year
> WHERE t3.tinyint_col = tt2.tinyint_col
> AND t3.bigint_col = tt2.bigint_col
> AND t3.int_col = tt1.id)
> AND t1.day NOT IN
> (SELECT ttt1.tinyint_col AS tinyint_col_1
> FROM alltypestiny ttt1
> WHERE t1.tinyint_col = ttt1.month
> AND ttt1.month IS NULL
> AND ttt1.tinyint_col NOT IN
> (SELECT tttt2.int_col AS int_col_1
> FROM alltypesagg tttt1
> RIGHT JOIN alltypes tttt2 ON tttt2.id = tttt1.year)
> AND ttt1.month NOT IN
> (SELECT MIN(tttt2.year + tttt1.smallint_col * tttt1.int_col -
> tttt2.bigint_col) + MAX(tttt2.tinyint_col) AS bigint_col_1
> FROM alltypestiny tttt1
> INNER JOIN alltypestiny tttt2 ON tttt2.month = tttt1.smallint_col)
> AND ttt1.month IS NOT NULL
> AND t2.smallint_col = ttt1.bigint_col)
> {code}
> The Impala explain"
> {code}
> +--------------------------------------------------------------------------------------------------------------------------------------+
> | Explain String
> |
> +--------------------------------------------------------------------------------------------------------------------------------------+
> | Estimated Per-Host Requirements: Memory=1.66GB VCores=5
> |
> |
> |
> | 36:AGGREGATE [FINALIZE]
> |
> | | output: count:merge(t2.month)
> |
> | |
> |
> | 35:EXCHANGE [UNPARTITIONED]
> |
> | |
> |
> | 20:AGGREGATE
> |
> | | output: count(t2.month)
> |
> | |
> |
> | 19:HASH JOIN [NULL AWARE LEFT ANTI JOIN, BROADCAST]
> |
> | | hash predicates: t1.day = ttt1.tinyint_col
> |
> | | other join predicates: t2.smallint_col = ttt1.bigint_col, t1.tinyint_col
> = ttt1.month |
> | |
> |
> | |--34:EXCHANGE [BROADCAST]
> |
> | | |
> |
> | | 15:HASH JOIN [NULL AWARE LEFT ANTI JOIN, BROADCAST]
> |
> | | | hash predicates: ttt1.month = min(tttt2.year + tttt1.smallint_col *
> tttt1.int_col - tttt2.bigint_col) + max(tttt2.tinyint_col) |
> | | |
> |
> | | |--33:EXCHANGE [BROADCAST]
> |
> | | | |
> |
> | | | 32:AGGREGATE [FINALIZE]
> |
> | | | | output: min:merge(tttt2.year + tttt1.smallint_col * tttt1.int_col
> - tttt2.bigint_col), max:merge(tttt2.tinyint_col) |
> | | | |
> |
> | | | 31:EXCHANGE [UNPARTITIONED]
> |
> | | | |
> |
> | | | 13:AGGREGATE
> |
> | | | | output: min(tttt2.year + tttt1.smallint_col * tttt1.int_col -
> tttt2.bigint_col), max(tttt2.tinyint_col) |
> | | | |
> |
> | | | 12:HASH JOIN [INNER JOIN, BROADCAST]
> |
> | | | | hash predicates: tttt2.month = tttt1.smallint_col
> |
> | | | |
> |
> | | | |--30:EXCHANGE [BROADCAST]
> |
> | | | | |
> |
> | | | | 10:SCAN HDFS [functional.alltypestiny tttt1]
> |
> | | | | partitions=4/4 size=460B
> |
> | | | |
> |
> | | | 11:SCAN HDFS [functional.alltypestiny tttt2]
> |
> | | | partitions=4/4 size=460B
> |
> | | |
> |
> | | 14:HASH JOIN [NULL AWARE LEFT ANTI JOIN, BROADCAST]
> |
> | | | hash predicates: ttt1.tinyint_col = tttt2.int_col
> |
> | | |
> |
> | | |--29:EXCHANGE [BROADCAST]
> |
> | | | |
> |
> | | | 09:HASH JOIN [LEFT OUTER JOIN, PARTITIONED]
> |
> | | | | hash predicates: tttt2.id = tttt1.year
> |
> | | | |
> |
> | | | |--28:EXCHANGE [HASH(tttt1.year)]
> |
> | | | | |
> |
> | | | | 07:SCAN HDFS [functional.alltypesagg tttt1]
> |
> | | | | partitions=11/11 size=814.73KB
> |
> | | | |
> |
> | | | 27:EXCHANGE [HASH(tttt2.id)]
> |
> | | | |
> |
> | | | 08:SCAN HDFS [functional.alltypes tttt2]
> |
> | | | partitions=24/24 size=478.45KB
> |
> | | |
> |
> | | 06:SCAN HDFS [functional.alltypestiny ttt1]
> |
> | | partitions=0/4 size=0B
> |
> | |
> |
> | 18:HASH JOIN [NULL AWARE LEFT ANTI JOIN, BROADCAST]
> |
> | | hash predicates: t1.day = tt1.year
> |
> | | other join predicates: t3.int_col = tt1.id, t3.bigint_col =
> tt2.bigint_col, t3.tinyint_col = tt2.tinyint_col |
> | |
> |
> | |--26:EXCHANGE [BROADCAST]
> |
> | | |
> |
> | | 05:HASH JOIN [LEFT OUTER JOIN, PARTITIONED]
> |
> | | | hash predicates: tt2.month = tt1.year
> |
> | | |
> |
> | | |--25:EXCHANGE [HASH(tt1.year)]
> |
> | | | |
> |
> | | | 03:SCAN HDFS [functional.alltypes tt1]
> |
> | | | partitions=24/24 size=478.45KB
> |
> | | |
> |
> | | 24:EXCHANGE [HASH(tt2.month)]
> |
> | | |
> |
> | | 04:SCAN HDFS [functional.alltypes tt2]
> |
> | | partitions=24/24 size=478.45KB
> |
> | |
> |
> | 17:HASH JOIN [FULL OUTER JOIN, PARTITIONED]
> |
> | | hash predicates: t2.int_col = t3.year, t2.year = t3.int_col
> |
> | |
> |
> | |--23:EXCHANGE [HASH(t3.year,t3.int_col)]
> |
> | | |
> |
> | | 02:SCAN HDFS [functional.alltypestiny t3]
> |
> | | partitions=4/4 size=460B
> |
> | |
> |
> | 22:EXCHANGE [HASH(t2.int_col,t2.year)]
> |
> | |
> |
> | 16:CROSS JOIN [BROADCAST]
> |
> | |
> |
> | |--21:EXCHANGE [BROADCAST]
> |
> | | |
> |
> | | 00:SCAN HDFS [functional.alltypesagg t1]
> |
> | | partitions=11/11 size=814.73KB
> |
> | |
> |
> | 01:SCAN HDFS [functional.alltypesagg t2]
> |
> | partitions=11/11 size=814.73KB
> |
> +--------------------------------------------------------------------------------------------------------------------------------------+
> {code}
> And Postgres's simpler plan, which uses a NLIJ:
> {code}
> ------------
> Aggregate (cost=104609484.78..104609484.79 rows=1 width=4)
> -> Hash Full Join (cost=1.20..104533859.78 rows=30250000 width=4)
> Hash Cond: ((t2.int_col = t3.year) AND (t2.year = t3.int_col))
> Filter: ((NOT (SubPlan 1)) AND (NOT (SubPlan 4)))
> -> Nested Loop (cost=0.00..1513041.50 rows=121000000 width=23)
> -> Seq Scan on alltypesagg t1 (cost=0.00..257.00 rows=11000
> width=9)
> -> Materialize (cost=0.00..312.00 rows=11000 width=14)
> -> Seq Scan on alltypesagg t2 (cost=0.00..257.00
> rows=11000 width=14)
> -> Hash (cost=1.08..1.08 rows=8 width=28)
> -> Seq Scan on alltypestiny t3 (cost=0.00..1.08 rows=8
> width=28)
> SubPlan 1
> -> Nested Loop (cost=13.13..122.84 rows=1 width=4)
> -> Index Scan using functional_alltypes_id on alltypes tt1
> (cost=0.28..8.30 rows=1 width=4)
> Index Cond: (t3.int_col = id)
> -> Bitmap Heap Scan on alltypes tt2 (cost=12.84..114.48
> rows=6 width=4)
> Recheck Cond: (month = tt1.year)
> Filter: ((t3.tinyint_col = tinyint_col) AND
> (t3.bigint_col = bigint_col))
> -> Bitmap Index Scan on functional_alltypes_month
> (cost=0.00..12.84 rows=608 width=0)
> Index Cond: (month = tt1.year)
> SubPlan 4
> -> Seq Scan on alltypestiny ttt1 (cost=775.99..777.21 rows=1
> width=12)
> Filter: ((month IS NULL) AND (month IS NOT NULL) AND
> (t2.smallint_col = bigint_col) AND (NOT (hashed SubPlan 2)) AND (NOT (hashed
> SubPlan 3)) AND (t1.tinyint_col = (month)
> ::numeric))
> SubPlan 2
> -> Hash Right Join (cost=255.25..746.00 rows=11000
> width=4)
> Hash Cond: (tttt1.year = tttt2.id)
> -> Seq Scan on alltypesagg tttt1
> (cost=0.00..257.00 rows=11000 width=4)
> -> Hash (cost=164.00..164.00 rows=7300 width=8)
> -> Seq Scan on alltypes tttt2
> (cost=0.00..164.00 rows=7300 width=8)
> SubPlan 3
> -> Aggregate (cost=2.47..2.48 rows=1 width=30)
> -> Hash Join (cost=1.18..2.37 rows=8 width=30)
> Hash Cond: (tttt1_1.smallint_col =
> tttt2_1.month)
> -> Seq Scan on alltypestiny tttt1_1
> (cost=0.00..1.08 rows=8 width=6)
> -> Hash (cost=1.08..1.08 rows=8 width=28)
> -> Seq Scan on alltypestiny tttt2_1
> (cost=0.00..1.08 rows=8 width=28)
> (35 rows)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]