[
https://issues.apache.org/jira/browse/IMPALA-5504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexander Behm resolved IMPALA-5504.
------------------------------------
Resolution: Fixed
Fix Version/s: Impala 2.10.0
commit 3e770405e328f883610ce40867f770e588839c13
Author: Alex Behm <[email protected]>
Date: Fri Aug 18 10:36:13 2017 -0700
IMPALA-5504: Fix TupleIsNullPredicate evaluation.
There was a bug in the BE evaluation logic of the
TupleIsNullPredicate which could lead to wrong results
for certain plan shapes.
A TupleIsNullPredicate should evaluate to true only if
all specified tuples are NULL. This was always the intent
of the FE and is also documented in the BE as the required
behavior.
Testing:
- Added regression test
- Core tests passed
Change-Id: Id659f849a68d88cfe22c65dd1747dd6d6a916163
Reviewed-on: http://gerrit.cloudera.org:8080/7737
Reviewed-by: Matthew Jacobs <[email protected]>
Tested-by: Impala Public Jenkins
> wrong results with LEFT JOIN, inline view, and COALESCE()
> ---------------------------------------------------------
>
> Key: IMPALA-5504
> URL: https://issues.apache.org/jira/browse/IMPALA-5504
> Project: IMPALA
> Issue Type: Bug
> Components: Frontend
> Affects Versions: Impala 2.8.0, Impala 2.9.0
> Reporter: Michael Brown
> Assignee: Alexander Behm
> Priority: Blocker
> Labels: correctness, query_generator
> Fix For: Impala 2.10.0
>
> Attachments: profile-coalesce-removed.txt, profile-wrong-results.txt
>
>
> *Summary*
> This query
> {noformat}
> USE tpch;
> SELECT
> COUNT(t1.ps_suppkey)
> FROM partsupp t1
> LEFT JOIN (
> SELECT
> COALESCE(t2.o_custkey, -1) AS coalesce_col
> FROM orders t2
> LEFT JOIN partsupp t3 ON (t2.o_shippriority) = (t3.ps_partkey)
> ) sq ON ((t1.ps_suppkey) = (sq.coalesce_col))
> {noformat}
> produces a different result compared to PostgreSQL. This occurs in both 2.8-
> and 2.9-based builds, so it doesn't seem to be a recent regression.
> The problem is the {{COALESCE()}} function in the inline view and its
> relationship to the {{LEFT JOIN}} in the outer query.
> *Analysis*
> Because {{t2.o_custkey}} is a primary key column and does not contain
> {{NULL}} values:
> {noformat}
> [localhost:21000] > select count(o_custkey) from orders where o_custkey is
> null;
> +------------------+
> | count(o_custkey) |
> +------------------+
> | 0 |
> +------------------+
> Fetched 1 row(s) in 0.64s
> [localhost:21000] >
> {noformat}
> This means I can remove the use of {{COALESCE(t2.o_custkey, -1)}} and just
> replace it with {{t2.o_custkey}}:
> {noformat}
> SELECT
> COUNT(t1.ps_suppkey) c
> FROM partsupp t1
> LEFT JOIN (
> SELECT
> t2.o_custkey AS coalesce_col
> FROM orders t2
> LEFT JOIN partsupp t3 ON (t2.o_shippriority) = (t3.ps_partkey)
> ) sq ON ((t1.ps_suppkey) = (sq.coalesce_col))
> ORDER BY c
> {noformat}
> When I do that, the results end up matching Postgres. This is my grounds for
> justifying this as an Impala bug, not a Postgres bug: either with or without
> the COALESCE(), the results should have been the same.
> Consider the query profiles for the query without the COALESCE, that produces
> the correct results:
> {noformat}
> Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak
> Mem Est. Peak Mem Detail
> -----------------------------------------------------------------------------------------------------------------------------
> 10:AGGREGATE 1 196.528us 196.528us 1 1 24.00
> KB 10.00 MB FINALIZE
> 09:EXCHANGE 1 38.678us 38.678us 2 1
> 0 0 UNPARTITIONED
> 05:AGGREGATE 2 68.671ms 75.556ms 2 1 50.89
> KB 10.00 MB
> 04:HASH JOIN 2 460.980ms 472.610ms 8.23M 800.00K 114.07
> MB 6.71 MB RIGHT OUTER JOIN, PARTITIONED
> |--08:EXCHANGE 2 53.569ms 60.459ms 800.00K 800.00K
> 0 0 HASH((t1.ps_suppkey))
> | 00:SCAN HDFS 1 19.808ms 19.808ms 800.00K 800.00K 33.02
> MB 176.00 MB tpch.partsupp t1
> 07:EXCHANGE 2 91.832ms 101.949ms 1.50M 1.50M
> 0 0 HASH(t2.o_custkey)
> 03:HASH JOIN 2 206.506ms 228.767ms 1.50M 1.50M 178.05
> MB 6.71 MB LEFT OUTER JOIN, BROADCAST
> |--06:EXCHANGE 2 95.342ms 97.771ms 800.00K 800.00K
> 0 0 BROADCAST
> | 02:SCAN HDFS 1 34.058ms 34.058ms 800.00K 800.00K 33.02
> MB 176.00 MB tpch.partsupp t3
> 01:SCAN HDFS 2 93.905ms 101.139ms 1.50M 1.50M 33.25
> MB 176.00 MB tpch.orders t2
> {noformat}
> Now compare this to the profile with the incorrect results:
> {noformat}
> Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak
> Mem Est. Peak Mem Detail
> ------------------------------------------------------------------------------------------------------------------------------
> 10:AGGREGATE 1 198.204us 198.204us 1 1 24.00
> KB 10.00 MB FINALIZE
> 09:EXCHANGE 1 93.555us 93.555us 2 1
> 0 0 UNPARTITIONED
> 05:AGGREGATE 2 3.066ms 3.717ms 2 1 8.04
> MB 10.00 MB
> 04:HASH JOIN 2 140.679ms 236.316ms 800.00K 800.00K 114.06
> MB 6.71 MB RIGHT OUTER JOIN, PARTITIONED
> |--08:EXCHANGE 2 51.184ms 53.408ms 800.00K 800.00K
> 0 0 HASH((t1.ps_suppkey))
> | 00:SCAN HDFS 1 9.601ms 9.601ms 800.00K 800.00K 33.02
> MB 176.00 MB tpch.partsupp t1
> 07:EXCHANGE 2 88.551ms 177.099ms 1.50M 1.50M
> 0 0 HASH(if(TupleIsNull(), NULL...
> 03:HASH JOIN 2 216.025ms 241.128ms 1.50M 1.50M 178.05
> MB 6.71 MB LEFT OUTER JOIN, BROADCAST
> |--06:EXCHANGE 2 93.506ms 105.640ms 800.00K 800.00K
> 0 0 BROADCAST
> | 02:SCAN HDFS 1 112.021ms 112.021ms 800.00K 800.00K 32.63
> MB 176.00 MB tpch.partsupp t3
> 01:SCAN HDFS 2 50.246ms 86.962ms 1.50M 1.50M 33.25
> MB 176.00 MB tpch.orders t2
> {noformat}
> In the second, incorrect results with COALESCE() case, we see that the {{04
> HASH JOIN}} operator processed 800K rows, whereas in the first, without
> COALESCE, it processed 8.23M. It turns out the {{partsupp}} table has exactly
> 800k rows. So in the buggy case, somehow the rows coming in from the
> {{orders}} side are not being processed by the JOIN.
> Note this was originally found by the random query generator; this was the
> original query:
> {noformat}
> USE tpch;
> SELECT
> COALESCE(AVG(a4.o_custkey), 30.0083477187, (COALESCE(a4.o_custkey,
> a1.ps_partkey, a1.ps_suppkey)) * (-65.4342458142)) AS float_col,
> COUNT(a1.ps_suppkey) AS int_col,
> COALESCE(a4.o_custkey, a1.ps_partkey, a1.ps_suppkey) AS int_col_1
> FROM partsupp a1
> LEFT JOIN (
> SELECT
> a2.o_custkey,
> COALESCE(COALESCE(a3.ps_availqty, a3.ps_partkey, a3.ps_suppkey),
> a3.ps_suppkey, COALESCE(a2.o_custkey, a3.ps_availqty)) AS int_col,
> a2.o_orderstatus,
> COALESCE(a2.o_custkey, LEAST(COALESCE(a2.o_custkey, 825),
> COALESCE(COALESCE(a2.o_orderkey, a3.ps_availqty, a2.o_orderkey), 538)),
> a3.ps_availqty) AS int_col_1,
> COALESCE(a2.o_orderdate, a2.o_clerk) AS char_col
> FROM orders a2
> LEFT JOIN partsupp a3 ON (a2.o_shippriority) = (a3.ps_partkey)
> WHERE
> ((a2.o_custkey) IS DISTINCT FROM (a3.ps_partkey)) OR ((a2.o_totalprice) IS
> NULL)
> ) a4 ON (False) OR ((a1.ps_suppkey) = (a4.int_col_1))
> WHERE
> ((a4.int_col_1) IS NOT DISTINCT FROM (a1.ps_suppkey)) AND ((a4.int_col) NOT
> IN (a1.ps_partkey, a1.ps_availqty))
> GROUP BY
> COALESCE(a4.o_custkey, a1.ps_partkey, a1.ps_suppkey)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)