[
https://issues.apache.org/jira/browse/IMPALA-1369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16966331#comment-16966331
]
Tim Armstrong commented on IMPALA-1369:
---------------------------------------
[~tozka] it's worth a try with a newer version. AFAIK we didn't change the core
algorithm but there have been changes to the join implementation and planner
since 2.9.
This is the first time I've seen this come up on a production query, I'd be
interested in seeing a query profile if it's something you can share.
> Slow query with NAAJs
> ---------------------
>
> Key: IMPALA-1369
> URL: https://issues.apache.org/jira/browse/IMPALA-1369
> Project: IMPALA
> Issue Type: Improvement
> Components: Perf Investigation
> Affects Versions: Impala 2.0
> Reporter: Ippokratis Pandis
> Priority: Minor
> Labels: performance
>
> With IMPALA-1351 the query below runs correctly, but it takes too much time,
> a few hours in my dev box. I think because essentially the NAAJs become
> something like nested loop joins. Postgres runs it in some low tenths of
> minutes.
> Below it is the query, and the explains of both Impala and Postgres.
> {code}
> SELECT COUNT(t2.month) AS int_col_1
> FROM alltypesagg t1 CROSS
> JOIN alltypesagg t2 FULL
> OUTER JOIN alltypestiny t3 ON t3.year = t2.int_col
> AND t3.int_col = t2.year
> WHERE t1.day NOT IN
> (SELECT tt1.year AS int_col_1
> FROM alltypes tt1
> RIGHT JOIN alltypes tt2 ON tt2.month = tt1.year
> WHERE t3.tinyint_col = tt2.tinyint_col
> AND t3.bigint_col = tt2.bigint_col
> AND t3.int_col = tt1.id)
> AND t1.day NOT IN
> (SELECT ttt1.tinyint_col AS tinyint_col_1
> FROM alltypestiny ttt1
> WHERE t1.tinyint_col = ttt1.month
> AND ttt1.month IS NULL
> AND ttt1.tinyint_col NOT IN
> (SELECT tttt2.int_col AS int_col_1
> FROM alltypesagg tttt1
> RIGHT JOIN alltypes tttt2 ON tttt2.id = tttt1.year)
> AND ttt1.month NOT IN
> (SELECT MIN(tttt2.year + tttt1.smallint_col * tttt1.int_col -
> tttt2.bigint_col) + MAX(tttt2.tinyint_col) AS bigint_col_1
> FROM alltypestiny tttt1
> INNER JOIN alltypestiny tttt2 ON tttt2.month = tttt1.smallint_col)
> AND ttt1.month IS NOT NULL
> AND t2.smallint_col = ttt1.bigint_col)
> {code}
> The Impala explain"
> {code}
> +--------------------------------------------------------------------------------------------------------------------------------------+
> | Explain String
> |
> +--------------------------------------------------------------------------------------------------------------------------------------+
> | Estimated Per-Host Requirements: Memory=1.66GB VCores=5
> |
> |
> |
> | 36:AGGREGATE [FINALIZE]
> |
> | | output: count:merge(t2.month)
> |
> | |
> |
> | 35:EXCHANGE [UNPARTITIONED]
> |
> | |
> |
> | 20:AGGREGATE
> |
> | | output: count(t2.month)
> |
> | |
> |
> | 19:HASH JOIN [NULL AWARE LEFT ANTI JOIN, BROADCAST]
> |
> | | hash predicates: t1.day = ttt1.tinyint_col
> |
> | | other join predicates: t2.smallint_col = ttt1.bigint_col, t1.tinyint_col
> = ttt1.month |
> | |
> |
> | |--34:EXCHANGE [BROADCAST]
> |
> | | |
> |
> | | 15:HASH JOIN [NULL AWARE LEFT ANTI JOIN, BROADCAST]
> |
> | | | hash predicates: ttt1.month = min(tttt2.year + tttt1.smallint_col *
> tttt1.int_col - tttt2.bigint_col) + max(tttt2.tinyint_col) |
> | | |
> |
> | | |--33:EXCHANGE [BROADCAST]
> |
> | | | |
> |
> | | | 32:AGGREGATE [FINALIZE]
> |
> | | | | output: min:merge(tttt2.year + tttt1.smallint_col * tttt1.int_col
> - tttt2.bigint_col), max:merge(tttt2.tinyint_col) |
> | | | |
> |
> | | | 31:EXCHANGE [UNPARTITIONED]
> |
> | | | |
> |
> | | | 13:AGGREGATE
> |
> | | | | output: min(tttt2.year + tttt1.smallint_col * tttt1.int_col -
> tttt2.bigint_col), max(tttt2.tinyint_col) |
> | | | |
> |
> | | | 12:HASH JOIN [INNER JOIN, BROADCAST]
> |
> | | | | hash predicates: tttt2.month = tttt1.smallint_col
> |
> | | | |
> |
> | | | |--30:EXCHANGE [BROADCAST]
> |
> | | | | |
> |
> | | | | 10:SCAN HDFS [functional.alltypestiny tttt1]
> |
> | | | | partitions=4/4 size=460B
> |
> | | | |
> |
> | | | 11:SCAN HDFS [functional.alltypestiny tttt2]
> |
> | | | partitions=4/4 size=460B
> |
> | | |
> |
> | | 14:HASH JOIN [NULL AWARE LEFT ANTI JOIN, BROADCAST]
> |
> | | | hash predicates: ttt1.tinyint_col = tttt2.int_col
> |
> | | |
> |
> | | |--29:EXCHANGE [BROADCAST]
> |
> | | | |
> |
> | | | 09:HASH JOIN [LEFT OUTER JOIN, PARTITIONED]
> |
> | | | | hash predicates: tttt2.id = tttt1.year
> |
> | | | |
> |
> | | | |--28:EXCHANGE [HASH(tttt1.year)]
> |
> | | | | |
> |
> | | | | 07:SCAN HDFS [functional.alltypesagg tttt1]
> |
> | | | | partitions=11/11 size=814.73KB
> |
> | | | |
> |
> | | | 27:EXCHANGE [HASH(tttt2.id)]
> |
> | | | |
> |
> | | | 08:SCAN HDFS [functional.alltypes tttt2]
> |
> | | | partitions=24/24 size=478.45KB
> |
> | | |
> |
> | | 06:SCAN HDFS [functional.alltypestiny ttt1]
> |
> | | partitions=0/4 size=0B
> |
> | |
> |
> | 18:HASH JOIN [NULL AWARE LEFT ANTI JOIN, BROADCAST]
> |
> | | hash predicates: t1.day = tt1.year
> |
> | | other join predicates: t3.int_col = tt1.id, t3.bigint_col =
> tt2.bigint_col, t3.tinyint_col = tt2.tinyint_col |
> | |
> |
> | |--26:EXCHANGE [BROADCAST]
> |
> | | |
> |
> | | 05:HASH JOIN [LEFT OUTER JOIN, PARTITIONED]
> |
> | | | hash predicates: tt2.month = tt1.year
> |
> | | |
> |
> | | |--25:EXCHANGE [HASH(tt1.year)]
> |
> | | | |
> |
> | | | 03:SCAN HDFS [functional.alltypes tt1]
> |
> | | | partitions=24/24 size=478.45KB
> |
> | | |
> |
> | | 24:EXCHANGE [HASH(tt2.month)]
> |
> | | |
> |
> | | 04:SCAN HDFS [functional.alltypes tt2]
> |
> | | partitions=24/24 size=478.45KB
> |
> | |
> |
> | 17:HASH JOIN [FULL OUTER JOIN, PARTITIONED]
> |
> | | hash predicates: t2.int_col = t3.year, t2.year = t3.int_col
> |
> | |
> |
> | |--23:EXCHANGE [HASH(t3.year,t3.int_col)]
> |
> | | |
> |
> | | 02:SCAN HDFS [functional.alltypestiny t3]
> |
> | | partitions=4/4 size=460B
> |
> | |
> |
> | 22:EXCHANGE [HASH(t2.int_col,t2.year)]
> |
> | |
> |
> | 16:CROSS JOIN [BROADCAST]
> |
> | |
> |
> | |--21:EXCHANGE [BROADCAST]
> |
> | | |
> |
> | | 00:SCAN HDFS [functional.alltypesagg t1]
> |
> | | partitions=11/11 size=814.73KB
> |
> | |
> |
> | 01:SCAN HDFS [functional.alltypesagg t2]
> |
> | partitions=11/11 size=814.73KB
> |
> +--------------------------------------------------------------------------------------------------------------------------------------+
> {code}
> And Postgres's simpler plan, which uses a NLIJ:
> {code}
> ------------
> Aggregate (cost=104609484.78..104609484.79 rows=1 width=4)
> -> Hash Full Join (cost=1.20..104533859.78 rows=30250000 width=4)
> Hash Cond: ((t2.int_col = t3.year) AND (t2.year = t3.int_col))
> Filter: ((NOT (SubPlan 1)) AND (NOT (SubPlan 4)))
> -> Nested Loop (cost=0.00..1513041.50 rows=121000000 width=23)
> -> Seq Scan on alltypesagg t1 (cost=0.00..257.00 rows=11000
> width=9)
> -> Materialize (cost=0.00..312.00 rows=11000 width=14)
> -> Seq Scan on alltypesagg t2 (cost=0.00..257.00
> rows=11000 width=14)
> -> Hash (cost=1.08..1.08 rows=8 width=28)
> -> Seq Scan on alltypestiny t3 (cost=0.00..1.08 rows=8
> width=28)
> SubPlan 1
> -> Nested Loop (cost=13.13..122.84 rows=1 width=4)
> -> Index Scan using functional_alltypes_id on alltypes tt1
> (cost=0.28..8.30 rows=1 width=4)
> Index Cond: (t3.int_col = id)
> -> Bitmap Heap Scan on alltypes tt2 (cost=12.84..114.48
> rows=6 width=4)
> Recheck Cond: (month = tt1.year)
> Filter: ((t3.tinyint_col = tinyint_col) AND
> (t3.bigint_col = bigint_col))
> -> Bitmap Index Scan on functional_alltypes_month
> (cost=0.00..12.84 rows=608 width=0)
> Index Cond: (month = tt1.year)
> SubPlan 4
> -> Seq Scan on alltypestiny ttt1 (cost=775.99..777.21 rows=1
> width=12)
> Filter: ((month IS NULL) AND (month IS NOT NULL) AND
> (t2.smallint_col = bigint_col) AND (NOT (hashed SubPlan 2)) AND (NOT (hashed
> SubPlan 3)) AND (t1.tinyint_col = (month)
> ::numeric))
> SubPlan 2
> -> Hash Right Join (cost=255.25..746.00 rows=11000
> width=4)
> Hash Cond: (tttt1.year = tttt2.id)
> -> Seq Scan on alltypesagg tttt1
> (cost=0.00..257.00 rows=11000 width=4)
> -> Hash (cost=164.00..164.00 rows=7300 width=8)
> -> Seq Scan on alltypes tttt2
> (cost=0.00..164.00 rows=7300 width=8)
> SubPlan 3
> -> Aggregate (cost=2.47..2.48 rows=1 width=30)
> -> Hash Join (cost=1.18..2.37 rows=8 width=30)
> Hash Cond: (tttt1_1.smallint_col =
> tttt2_1.month)
> -> Seq Scan on alltypestiny tttt1_1
> (cost=0.00..1.08 rows=8 width=6)
> -> Hash (cost=1.08..1.08 rows=8 width=28)
> -> Seq Scan on alltypestiny tttt2_1
> (cost=0.00..1.08 rows=8 width=28)
> (35 rows)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]