[
https://issues.apache.org/jira/browse/HIVE-26659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698761#comment-17698761
]
Aman Sinha commented on HIVE-26659:
-----------------------------------
I found that the wrong results happens if there is a MergeJoin feeding into the
MapJoin(anti-join in this case). If both the joins are MergeJoins or if both
are MapJoins, the results are correct. After a few attempts, I can repro the
issue with the following simplified test case using 3 tables and faking the
stats to force MergeJoin:
{noformat}
create table tt1 (ws_order_number bigint, ws_ext_ship_cost decimal(7, 2));
create table tt2 (ws_order_number bigint);
create table tt3 (wr_order_number bigint);
insert into tt1 values (42, 3093.96), (1041, 299.28), (1378, 85.56), (1378,
719.44), (1395, 145.68);
insert into tt2 values (1378), (1395);
insert into tt3 values (42), (1041);
alter table tt1 update statistics set ('numRows'='10000000');
alter table tt2 update statistics set ('numRows'='10000000');
alter table tt3 update statistics set ('numRows'='2');
select * from tt1;
+----------------------+-----------------------+
| tt1.ws_order_number | tt1.ws_ext_ship_cost |
+----------------------+-----------------------+
| 42 | 3093.96 |
| 1041 | 299.28 |
| 1378 | 85.56 |
| 1378 | 719.44 |
| 1395 | 145.68 |
+----------------------+-----------------------+
5 rows selected (0.138 seconds)
select * from tt2;
+----------------------+
| tt2.ws_order_number |
+----------------------+
| 1378 |
| 1395 |
+----------------------+
2 rows selected (0.105 seconds)
select * from tt3;
+----------------------+
| tt3.wr_order_number |
+----------------------+
| 42 |
| 1041 |
+----------------------+
2 rows selected (0.164 seconds)
{noformat}
Simplified query:
{noformat}
select
sum(ws_ext_ship_cost) as "total shipping cost"
from
tt1 ws1, tt2 ws2 where ws1.ws_order_number = ws2.ws_order_number
and not exists(select * from tt3 wr1 where ws1.ws_order_number =
wr1.wr_order_number)
// wrong results with anti-join enabled
set hive.auto.convert.anti.join=true;
+----------------------+
| total shipping cost |
+----------------------+
| NULL |
+----------------------+
// right results with anti-join disabled
set hive.auto.convert.anti.join=false;
+----------------------+
| total shipping cost |
+----------------------+
| 950.68 |
+----------------------+
{noformat}
Explain plan with anti-join enabled:
{noformat}
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| Plan optimized by CBO. |
| |
| Vertex dependency in root stage |
| Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE), Map 5 (BROADCAST_EDGE)
|
| Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE) |
| |
| Stage-0 |
| Fetch Operator |
| limit:-1 |
| Stage-1 |
| Reducer 3 vectorized |
| File Output Operator [FS_43] |
| Group By Operator [GBY_42] (rows=1 width=112) |
| Output:["_col0"],aggregations:["sum(VALUE._col0)"] |
| <-Reducer 2 [CUSTOM_SIMPLE_EDGE] |
| PARTITION_ONLY_SHUFFLE [RS_19] |
| Group By Operator [GBY_18] (rows=1 width=112) |
| Output:["_col0"],aggregations:["sum(_col1)"] |
| Map Join Operator [MAPJOIN_30] (rows=11495000 width=114) |
| Conds:MERGEJOIN_29._col0=RS_41._col0(Anti),Output:["_col1"] |
| <-Map 5 [BROADCAST_EDGE] vectorized |
| BROADCAST [RS_41] |
| PartitionCols:_col0 |
| Group By Operator [GBY_40] (rows=2 width=8) |
| Output:["_col0"],keys:_col0 |
| Select Operator [SEL_39] (rows=2 width=8) |
| Output:["_col0"] |
| Filter Operator [FIL_38] (rows=2 width=8) |
| predicate:wr_order_number is not null |
| TableScan [TS_6] (rows=2 width=8) |
|
tpcds@tt3,wr1,Tbl:COMPLETE,Col:NONE,Output:["wr_order_number"] |
| <-Merge Join Operator [MERGEJOIN_29] (rows=10450000 width=114) |
|
Conds:RS_34._col0=RS_37._col0(Inner),Output:["_col0","_col1"] |
| <-Map 1 [SIMPLE_EDGE] vectorized |
| SHUFFLE [RS_34] |
| PartitionCols:_col0 |
| Select Operator [SEL_33] (rows=9500000 width=114) |
| Output:["_col0","_col1"] |
| Filter Operator [FIL_32] (rows=9500000 width=114) |
| predicate:ws_order_number is not null |
| TableScan [TS_0] (rows=10000000 width=114) |
|
tpcds@tt1,ws1,Tbl:COMPLETE,Col:NONE,Output:["ws_order_number","ws_ext_ship_cost"]
|
| <-Map 4 [SIMPLE_EDGE] vectorized |
| SHUFFLE [RS_37] |
| PartitionCols:_col0 |
| Select Operator [SEL_36] (rows=9500000 width=7) |
| Output:["_col0"] |
| Filter Operator [FIL_35] (rows=9500000 width=7) |
| predicate:ws_order_number is not null |
| TableScan [TS_3] (rows=10000000 width=7) |
|
tpcds@tt2,ws2,Tbl:COMPLETE,Col:NONE,Output:["ws_order_number"] |
| |
+----------------------------------------------------+
{noformat}
Note that there's a MergeJoin below the MapJoin (antijoin).
> TPC-DS query 16, 69, 94 return wrong results.
> ---------------------------------------------
>
> Key: HIVE-26659
> URL: https://issues.apache.org/jira/browse/HIVE-26659
> Project: Hive
> Issue Type: Sub-task
> Affects Versions: 4.0.0-alpha-2
> Reporter: Sungwoo Park
> Priority: Major
>
> TPC-DS query 16, 69, 94 return wrong results when hive.auto.convert.anti.join
> is set to true.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)