[jira] [Resolved] (IMPALA-5504) wrong results with LEFT JOIN, inline view, and COALESCE()

Alexander Behm (JIRA) Mon, 21 Aug 2017 15:09:20 -0700

     [ 
https://issues.apache.org/jira/browse/IMPALA-5504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Alexander Behm resolved IMPALA-5504.
------------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.10.0

commit 3e770405e328f883610ce40867f770e588839c13
Author: Alex Behm <[email protected]>
Date:   Fri Aug 18 10:36:13 2017 -0700

    IMPALA-5504: Fix TupleIsNullPredicate evaluation.
    
    There was a bug in the BE evaluation logic of the
    TupleIsNullPredicate which could lead to wrong results
    for certain plan shapes.
    
    A TupleIsNullPredicate should evaluate to true only if
    all specified tuples are NULL. This was always the intent
    of the FE and is also documented in the BE as the required
    behavior.
    
    Testing:
    - Added regression test
    - Core tests passed
    
    Change-Id: Id659f849a68d88cfe22c65dd1747dd6d6a916163
    Reviewed-on: http://gerrit.cloudera.org:8080/7737
    Reviewed-by: Matthew Jacobs <[email protected]>
    Tested-by: Impala Public Jenkins


> wrong results with LEFT JOIN, inline view, and COALESCE()
> ---------------------------------------------------------
>
>                 Key: IMPALA-5504
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5504
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 2.8.0, Impala 2.9.0
>            Reporter: Michael Brown
>            Assignee: Alexander Behm
>            Priority: Blocker
>              Labels: correctness, query_generator
>             Fix For: Impala 2.10.0
>
>         Attachments: profile-coalesce-removed.txt, profile-wrong-results.txt
>
>
> *Summary*
> This query
> {noformat}
> USE tpch;
> SELECT
> COUNT(t1.ps_suppkey)
> FROM partsupp t1
> LEFT JOIN  (
> SELECT
> COALESCE(t2.o_custkey, -1) AS coalesce_col
> FROM orders t2
> LEFT JOIN partsupp t3 ON (t2.o_shippriority) = (t3.ps_partkey)
> ) sq ON ((t1.ps_suppkey) = (sq.coalesce_col))
> {noformat}
> produces a different result compared to PostgreSQL. This occurs in both 2.8- 
> and 2.9-based builds, so it doesn't seem to be a recent regression.
> The problem is the {{COALESCE()}} function in the inline view and its 
> relationship to the {{LEFT JOIN}} in the outer query.
> *Analysis*
> Because {{t2.o_custkey}} is a primary key column and does not contain 
> {{NULL}} values:
> {noformat}
> [localhost:21000] > select count(o_custkey) from orders where o_custkey is 
> null;
> +------------------+
> | count(o_custkey) |
> +------------------+
> | 0                |
> +------------------+
> Fetched 1 row(s) in 0.64s
> [localhost:21000] >
> {noformat}
> This means I can remove the use of {{COALESCE(t2.o_custkey, -1)}} and just 
> replace it with {{t2.o_custkey}}:
> {noformat}
> SELECT
> COUNT(t1.ps_suppkey) c
> FROM partsupp t1
> LEFT JOIN  (
> SELECT
> t2.o_custkey AS coalesce_col
> FROM orders t2
> LEFT JOIN partsupp t3 ON (t2.o_shippriority) = (t3.ps_partkey)
> ) sq ON ((t1.ps_suppkey) = (sq.coalesce_col))
> ORDER BY c
> {noformat}
> When I do that, the results end up matching Postgres. This is my grounds for 
> justifying this as an Impala bug, not a Postgres bug: either with or without 
> the COALESCE(), the results should have been the same.
> Consider the query profiles for the query without the COALESCE, that produces 
> the correct results:
> {noformat}
> Operator          #Hosts   Avg Time   Max Time    #Rows  Est. #Rows   Peak 
> Mem  Est. Peak Mem  Detail                        
> -----------------------------------------------------------------------------------------------------------------------------
> 10:AGGREGATE           1  196.528us  196.528us        1           1   24.00 
> KB       10.00 MB  FINALIZE                      
> 09:EXCHANGE            1   38.678us   38.678us        2           1          
> 0              0  UNPARTITIONED                 
> 05:AGGREGATE           2   68.671ms   75.556ms        2           1   50.89 
> KB       10.00 MB                                
> 04:HASH JOIN           2  460.980ms  472.610ms    8.23M     800.00K  114.07 
> MB        6.71 MB  RIGHT OUTER JOIN, PARTITIONED 
> |--08:EXCHANGE         2   53.569ms   60.459ms  800.00K     800.00K          
> 0              0  HASH((t1.ps_suppkey))         
> |  00:SCAN HDFS        1   19.808ms   19.808ms  800.00K     800.00K   33.02 
> MB      176.00 MB  tpch.partsupp t1              
> 07:EXCHANGE            2   91.832ms  101.949ms    1.50M       1.50M          
> 0              0  HASH(t2.o_custkey)            
> 03:HASH JOIN           2  206.506ms  228.767ms    1.50M       1.50M  178.05 
> MB        6.71 MB  LEFT OUTER JOIN, BROADCAST    
> |--06:EXCHANGE         2   95.342ms   97.771ms  800.00K     800.00K          
> 0              0  BROADCAST                     
> |  02:SCAN HDFS        1   34.058ms   34.058ms  800.00K     800.00K   33.02 
> MB      176.00 MB  tpch.partsupp t3              
> 01:SCAN HDFS           2   93.905ms  101.139ms    1.50M       1.50M   33.25 
> MB      176.00 MB  tpch.orders t2   
> {noformat}
> Now compare this to the profile with the incorrect results:
> {noformat}
> Operator          #Hosts   Avg Time   Max Time    #Rows  Est. #Rows   Peak 
> Mem  Est. Peak Mem  Detail                         
> ------------------------------------------------------------------------------------------------------------------------------
> 10:AGGREGATE           1  198.204us  198.204us        1           1   24.00 
> KB       10.00 MB  FINALIZE                       
> 09:EXCHANGE            1   93.555us   93.555us        2           1          
> 0              0  UNPARTITIONED                  
> 05:AGGREGATE           2    3.066ms    3.717ms        2           1    8.04 
> MB       10.00 MB                                 
> 04:HASH JOIN           2  140.679ms  236.316ms  800.00K     800.00K  114.06 
> MB        6.71 MB  RIGHT OUTER JOIN, PARTITIONED  
> |--08:EXCHANGE         2   51.184ms   53.408ms  800.00K     800.00K          
> 0              0  HASH((t1.ps_suppkey))          
> |  00:SCAN HDFS        1    9.601ms    9.601ms  800.00K     800.00K   33.02 
> MB      176.00 MB  tpch.partsupp t1               
> 07:EXCHANGE            2   88.551ms  177.099ms    1.50M       1.50M          
> 0              0  HASH(if(TupleIsNull(), NULL... 
> 03:HASH JOIN           2  216.025ms  241.128ms    1.50M       1.50M  178.05 
> MB        6.71 MB  LEFT OUTER JOIN, BROADCAST     
> |--06:EXCHANGE         2   93.506ms  105.640ms  800.00K     800.00K          
> 0              0  BROADCAST                      
> |  02:SCAN HDFS        1  112.021ms  112.021ms  800.00K     800.00K   32.63 
> MB      176.00 MB  tpch.partsupp t3               
> 01:SCAN HDFS           2   50.246ms   86.962ms    1.50M       1.50M   33.25 
> MB      176.00 MB  tpch.orders t2   
> {noformat}
> In the second, incorrect results with COALESCE() case, we see that the {{04 
> HASH JOIN}} operator processed 800K rows, whereas in the first, without 
> COALESCE, it processed 8.23M. It turns out the {{partsupp}} table has exactly 
> 800k rows. So in the buggy case, somehow the rows coming in from the 
> {{orders}} side are not being processed by the JOIN.
> Note this was originally found by the random query generator; this was the 
> original query:
> {noformat}
> USE tpch;
> SELECT
> COALESCE(AVG(a4.o_custkey), 30.0083477187, (COALESCE(a4.o_custkey, 
> a1.ps_partkey, a1.ps_suppkey)) * (-65.4342458142)) AS float_col,
> COUNT(a1.ps_suppkey) AS int_col,
> COALESCE(a4.o_custkey, a1.ps_partkey, a1.ps_suppkey) AS int_col_1
> FROM partsupp a1
> LEFT JOIN (
> SELECT
> a2.o_custkey,
> COALESCE(COALESCE(a3.ps_availqty, a3.ps_partkey, a3.ps_suppkey), 
> a3.ps_suppkey, COALESCE(a2.o_custkey, a3.ps_availqty)) AS int_col,
> a2.o_orderstatus,
> COALESCE(a2.o_custkey, LEAST(COALESCE(a2.o_custkey, 825), 
> COALESCE(COALESCE(a2.o_orderkey, a3.ps_availqty, a2.o_orderkey), 538)), 
> a3.ps_availqty) AS int_col_1,
> COALESCE(a2.o_orderdate, a2.o_clerk) AS char_col
> FROM orders a2
> LEFT JOIN partsupp a3 ON (a2.o_shippriority) = (a3.ps_partkey)
> WHERE
> ((a2.o_custkey) IS DISTINCT FROM (a3.ps_partkey)) OR ((a2.o_totalprice) IS 
> NULL)
> ) a4 ON (False) OR ((a1.ps_suppkey) = (a4.int_col_1))
> WHERE
> ((a4.int_col_1) IS NOT DISTINCT FROM (a1.ps_suppkey)) AND ((a4.int_col) NOT 
> IN (a1.ps_partkey, a1.ps_availqty))
> GROUP BY
> COALESCE(a4.o_custkey, a1.ps_partkey, a1.ps_suppkey)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (IMPALA-5504) wrong results with LEFT JOIN, inline view, and COALESCE()

Reply via email to