[
https://issues.apache.org/jira/browse/HIVE-17148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16104919#comment-16104919
]
Vlad Gudikov edited comment on HIVE-17148 at 7/28/17 1:08 PM:
--------------------------------------------------------------
ROOT-CAUSE:
The problem was with the predicates that were created according to
HiveJoinAddNotNullRule. This rule is creating predicates from fields that take
part in join filter, no matter if this fields are used as parameters of
functions or not.
SOLUTION:
Create predicate based on functions that take part in filters as well as
fields. The point is to check if left part and right part of the filter is not
null, not just fields that are part of the join filter. I.e we have to tables
*test1(a1 int, a2 int)* and *test2(b1)*. When we execute following query
*select * from ct1 c1 inner join ct2 c2 on (COALESCE(a1,b1)=a2);* we get to
predicates for filter operator:
b1 is not null --- right part
a1 is not null and a2 is not null -- left part
Applying predicate for left part of join will result in data loss as we exclude
rows with null fields. COALESCE is a good example for this case as the main
purpose of COALESCE function is to get not null values from tables. To fix the
data loss we need to check that coalesce won't bring us null values as we can't
join nulls. My fix will check that left part and right part will look like:
b1 is not null -- right part (still checking fields on null condition)
COALESCE(a1,a2) is not null (checking that whole function won't bring us null
values)
In next patch I'm going to change related failed tests with the fixed stage
plans.
was (Author: allgoodok):
ROOT-CAUSE:
The problem was with the predicates that were created according to
HiveJoinAddNotNullRule. This rule is creating predicates from fields that take
part in join filter, no matter if this fields are used as parameters of
functions or not.
SOLUTION:
Create predicate based on functions that take part in filters as well as
fields. The point is to check if left part and right part of the filter is not
null, not just fields that are part of the join filter. I.e we have to tables
test1(a1 int, a2 int) and test2(b1). When we execute following query *select *
from ct1 c1 inner join ct2 c2 on (COALESCE(a1,b1)=a2);* we get to predicates
for filter operator:
b1 is not null --- right part
a1 is not null and a2 is not null -- left part
Applying predicate for left part of join will result in data loss as we exclude
rows with null fields. COALESCE is a good example for this case as the main
purpose of COALESCE function is to get not null values from tables. To fix the
data loss we need to check that coalesce won't bring us null values as we can't
join nulls. My fix will check that left part and right part will look like:
b1 is not null -- right part (still checking fields on null condition)
COALESCE(a1,a2) is not null (checking that whole function won't bring us null
values)
In next patch I'm going to change related failed tests with the fixed stage
plans.
> Incorrect result for Hive join query with COALESCE in WHERE condition
> ---------------------------------------------------------------------
>
> Key: HIVE-17148
> URL: https://issues.apache.org/jira/browse/HIVE-17148
> Project: Hive
> Issue Type: Bug
> Components: CBO
> Affects Versions: 2.1.1
> Reporter: Vlad Gudikov
> Assignee: Vlad Gudikov
> Attachments: HIVE-17148.patch
>
>
> The issue exists in Hive-2.1. In Hive-1.2 the query works fine with cbo
> enabled:
> STEPS TO REPRODUCE:
> {code}
> Step 1: Create a table ct1
> create table ct1 (a1 string,b1 string);
> Step 2: Create a table ct2
> create table ct2 (a2 string);
> Step 3 : Insert following data into table ct1
> insert into table ct1 (a1) values ('1');
> Step 4 : Insert following data into table ct2
> insert into table ct2 (a2) values ('1');
> Step 5 : Execute the following query
> select * from ct1 c1, ct2 c2 where COALESCE(a1,b1)=a2;
> {code}
> ACTUAL RESULT:
> {code}
> The query returns nothing;
> {code}
> EXPECTED RESULT:
> {code}
> 1 NULL 1
> {code}
> The issue seems to be because of the incorrect query plan. In the plan we can
> see:
> predicate:(a1 is not null and b1 is not null)
> which does not look correct. As a result, it is filtering out all the rows is
> any column mentioned in the COALESCE has null value.
> Please find the query plan below:
> {code}
> Plan optimized by CBO.
> Vertex dependency in root stage
> Map 1 <- Map 2 (BROADCAST_EDGE)
> Stage-0
> Fetch Operator
> limit:-1
> Stage-1
> Map 1
> File Output Operator [FS_10]
> Map Join Operator [MAPJOIN_15] (rows=1 width=4)
>
> Conds:SEL_2.COALESCE(_col0,_col1)=RS_7._col0(Inner),HybridGraceHashJoin:true,Output:["_col0","_col1","_col2"]
> <-Map 2 [BROADCAST_EDGE]
> BROADCAST [RS_7]
> PartitionCols:_col0
> Select Operator [SEL_5] (rows=1 width=1)
> Output:["_col0"]
> Filter Operator [FIL_14] (rows=1 width=1)
> predicate:a2 is not null
> TableScan [TS_3] (rows=1 width=1)
> default@ct2,c2,Tbl:COMPLETE,Col:NONE,Output:["a2"]
> <-Select Operator [SEL_2] (rows=1 width=4)
> Output:["_col0","_col1"]
> Filter Operator [FIL_13] (rows=1 width=4)
> predicate:(a1 is not null and b1 is not null)
> TableScan [TS_0] (rows=1 width=4)
> default@ct1,c1,Tbl:COMPLETE,Col:NONE,Output:["a1","b1"]
> {code}
> This happens only if join is inner type, otherwise HiveJoinAddNotRule which
> creates this problem is skipped.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)