[jira] [Comment Edited] (HIVE-17148) Incorrect result for Hive join query with COALESCE in WHERE condition

Vlad Gudikov (JIRA) Fri, 28 Jul 2017 06:09:33 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-17148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16104919#comment-16104919
 ]


Vlad Gudikov edited comment on HIVE-17148 at 7/28/17 1:08 PM:
--------------------------------------------------------------

ROOT-CAUSE:
The problem was with the predicates that were created according to 
HiveJoinAddNotNullRule. This rule is creating predicates from fields that take 
part in join filter, no matter if this fields are used as parameters of 
functions or not.

SOLUTION:
Create predicate based on functions that take part in filters as well as 
fields. The point is to check if left part and right part of the filter is not 
null, not just fields that are part of the join filter. I.e we have to tables 
*test1(a1 int, a2 int)* and *test2(b1)*. When we execute following query 
*select * from ct1 c1 inner join ct2 c2 on (COALESCE(a1,b1)=a2);* we get to 
predicates for filter operator:
b1 is not null --- right part 
a1 is not null and a2 is not null -- left part

Applying predicate for left part of join will result in data loss as we exclude 
rows with null fields. COALESCE is a good example for this case as the main 
purpose of COALESCE function is to get not null values from tables. To fix the 
data loss we need to check that coalesce won't bring us null values as we can't 
join nulls. My fix will check that left part and right part will look like:

b1 is not null -- right part (still checking fields on null condition)
COALESCE(a1,a2) is not null (checking that whole function won't bring us null 
values)

In next patch I'm going to change related failed tests with the fixed stage 
plans.



was (Author: allgoodok):
ROOT-CAUSE:
The problem was with the predicates that were created according to 
HiveJoinAddNotNullRule. This rule is creating predicates from fields that take 
part in join filter, no matter if this fields are used as parameters of 
functions or not.

SOLUTION:
Create predicate based on functions that take part in filters as well as 
fields. The point is to check if left part and right part of the filter is not 
null, not just fields that are part of the join filter. I.e we have to tables 
test1(a1 int, a2 int) and test2(b1). When we execute following query *select * 
from ct1 c1 inner join ct2 c2 on (COALESCE(a1,b1)=a2);* we get to predicates 
for filter operator:
b1 is not null --- right part 
a1 is not null and a2 is not null -- left part

Applying predicate for left part of join will result in data loss as we exclude 
rows with null fields. COALESCE is a good example for this case as the main 
purpose of COALESCE function is to get not null values from tables. To fix the 
data loss we need to check that coalesce won't bring us null values as we can't 
join nulls. My fix will check that left part and right part will look like:

b1 is not null -- right part (still checking fields on null condition)
COALESCE(a1,a2) is not null (checking that whole function won't bring us null 
values)

In next patch I'm going to change related failed tests with the fixed stage 
plans.


> Incorrect result for Hive join query with COALESCE in WHERE condition
> ---------------------------------------------------------------------
>
>                 Key: HIVE-17148
>                 URL: https://issues.apache.org/jira/browse/HIVE-17148
>             Project: Hive
>          Issue Type: Bug
>          Components: CBO
>    Affects Versions: 2.1.1
>            Reporter: Vlad Gudikov
>            Assignee: Vlad Gudikov
>         Attachments: HIVE-17148.patch
>
>
> The issue exists in Hive-2.1. In Hive-1.2 the query works fine with cbo 
> enabled:
> STEPS TO REPRODUCE:
> {code}
> Step 1: Create a table ct1
> create table ct1 (a1 string,b1 string);
> Step 2: Create a table ct2
> create table ct2 (a2 string);
> Step 3 : Insert following data into table ct1
> insert into table ct1 (a1) values ('1');
> Step 4 : Insert following data into table ct2
> insert into table ct2 (a2) values ('1');
> Step 5 : Execute the following query 
> select * from ct1 c1, ct2 c2 where COALESCE(a1,b1)=a2;
> {code}
> ACTUAL RESULT:
> {code}
> The query returns nothing;
> {code}
> EXPECTED RESULT:
> {code}
> 1       NULL    1
> {code}
> The issue seems to be because of the incorrect query plan. In the plan we can 
> see:
> predicate:(a1 is not null and b1 is not null)
> which does not look correct. As a result, it is filtering out all the rows is 
> any column mentioned in the COALESCE has null value.
> Please find the query plan below:
> {code}
> Plan optimized by CBO.
> Vertex dependency in root stage
> Map 1 <- Map 2 (BROADCAST_EDGE)
> Stage-0
>   Fetch Operator
>     limit:-1
>     Stage-1
>       Map 1
>       File Output Operator [FS_10]
>         Map Join Operator [MAPJOIN_15] (rows=1 width=4)
>           
> Conds:SEL_2.COALESCE(_col0,_col1)=RS_7._col0(Inner),HybridGraceHashJoin:true,Output:["_col0","_col1","_col2"]
>         <-Map 2 [BROADCAST_EDGE]
>           BROADCAST [RS_7]
>             PartitionCols:_col0
>             Select Operator [SEL_5] (rows=1 width=1)
>               Output:["_col0"]
>               Filter Operator [FIL_14] (rows=1 width=1)
>                 predicate:a2 is not null
>                 TableScan [TS_3] (rows=1 width=1)
>                   default@ct2,c2,Tbl:COMPLETE,Col:NONE,Output:["a2"]
>         <-Select Operator [SEL_2] (rows=1 width=4)
>             Output:["_col0","_col1"]
>             Filter Operator [FIL_13] (rows=1 width=4)
>               predicate:(a1 is not null and b1 is not null)
>               TableScan [TS_0] (rows=1 width=4)
>                 default@ct1,c1,Tbl:COMPLETE,Col:NONE,Output:["a1","b1"]
> {code}
> This happens only if join is inner type, otherwise HiveJoinAddNotRule which 
> creates this problem is skipped.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (HIVE-17148) Incorrect result for Hive join query with COALESCE in WHERE condition

Reply via email to