[ 
https://issues.apache.org/jira/browse/HIVE-26659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698761#comment-17698761
 ] 

Aman Sinha commented on HIVE-26659:
-----------------------------------

I found that the wrong results happens if there is a MergeJoin feeding into the 
MapJoin(anti-join in this case).  If both the joins are MergeJoins or if both 
are MapJoins, the results are correct. After a few attempts, I can repro the 
issue with the following simplified test case using 3 tables and faking the 
stats to force MergeJoin:

{noformat}
create table tt1 (ws_order_number bigint, ws_ext_ship_cost decimal(7, 2));
create table tt2 (ws_order_number bigint);
create table tt3 (wr_order_number bigint);

insert into tt1 values (42, 3093.96), (1041, 299.28), (1378, 85.56), (1378, 
719.44), (1395, 145.68);
insert into tt2 values (1378), (1395);
insert into tt3 values (42), (1041);

alter table tt1 update statistics set ('numRows'='10000000');
alter table tt2 update statistics set ('numRows'='10000000');
alter table tt3 update statistics set ('numRows'='2');

select * from tt1;
+----------------------+-----------------------+
| tt1.ws_order_number  | tt1.ws_ext_ship_cost  |
+----------------------+-----------------------+
| 42                   | 3093.96               |
| 1041                 | 299.28                |
| 1378                 | 85.56                 |
| 1378                 | 719.44                |
| 1395                 | 145.68                |
+----------------------+-----------------------+
5 rows selected (0.138 seconds)

select * from tt2;
+----------------------+
| tt2.ws_order_number  |
+----------------------+
| 1378                 |
| 1395                 |
+----------------------+
2 rows selected (0.105 seconds)

select * from tt3;
+----------------------+
| tt3.wr_order_number  |
+----------------------+
| 42                   |
| 1041                 |
+----------------------+
2 rows selected (0.164 seconds)
{noformat}

Simplified query:
{noformat}
select
 sum(ws_ext_ship_cost) as "total shipping cost"
from
tt1 ws1, tt2 ws2 where ws1.ws_order_number = ws2.ws_order_number
and not exists(select * from tt3 wr1 where ws1.ws_order_number = 
wr1.wr_order_number)

// wrong results with anti-join enabled

set hive.auto.convert.anti.join=true;
+----------------------+
| total shipping cost  |
+----------------------+
| NULL                 |
+----------------------+

// right results with anti-join disabled

set hive.auto.convert.anti.join=false;
+----------------------+
| total shipping cost  |
+----------------------+
| 950.68               |
+----------------------+

{noformat}

Explain plan with anti-join enabled:
{noformat}
+----------------------------------------------------+
|                      Explain                       |
+----------------------------------------------------+
| Plan optimized by CBO.                             |
|                                                    |
| Vertex dependency in root stage                    |
| Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE), Map 5 (BROADCAST_EDGE) 
|
| Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE)        |
|                                                    |
| Stage-0                                            |
|   Fetch Operator                                   |
|     limit:-1                                       |
|     Stage-1                                        |
|       Reducer 3 vectorized                         |
|       File Output Operator [FS_43]                 |
|         Group By Operator [GBY_42] (rows=1 width=112) |
|           Output:["_col0"],aggregations:["sum(VALUE._col0)"] |
|         <-Reducer 2 [CUSTOM_SIMPLE_EDGE]           |
|           PARTITION_ONLY_SHUFFLE [RS_19]           |
|             Group By Operator [GBY_18] (rows=1 width=112) |
|               Output:["_col0"],aggregations:["sum(_col1)"] |
|               Map Join Operator [MAPJOIN_30] (rows=11495000 width=114) |
|                 Conds:MERGEJOIN_29._col0=RS_41._col0(Anti),Output:["_col1"] |
|               <-Map 5 [BROADCAST_EDGE] vectorized  |
|                 BROADCAST [RS_41]                  |
|                   PartitionCols:_col0              |
|                   Group By Operator [GBY_40] (rows=2 width=8) |
|                     Output:["_col0"],keys:_col0    |
|                     Select Operator [SEL_39] (rows=2 width=8) |
|                       Output:["_col0"]             |
|                       Filter Operator [FIL_38] (rows=2 width=8) |
|                         predicate:wr_order_number is not null |
|                         TableScan [TS_6] (rows=2 width=8) |
|                           
tpcds@tt3,wr1,Tbl:COMPLETE,Col:NONE,Output:["wr_order_number"] |
|               <-Merge Join Operator [MERGEJOIN_29] (rows=10450000 width=114) |
|                   
Conds:RS_34._col0=RS_37._col0(Inner),Output:["_col0","_col1"] |
|                 <-Map 1 [SIMPLE_EDGE] vectorized   |
|                   SHUFFLE [RS_34]                  |
|                     PartitionCols:_col0            |
|                     Select Operator [SEL_33] (rows=9500000 width=114) |
|                       Output:["_col0","_col1"]     |
|                       Filter Operator [FIL_32] (rows=9500000 width=114) |
|                         predicate:ws_order_number is not null |
|                         TableScan [TS_0] (rows=10000000 width=114) |
|                           
tpcds@tt1,ws1,Tbl:COMPLETE,Col:NONE,Output:["ws_order_number","ws_ext_ship_cost"]
 |
|                 <-Map 4 [SIMPLE_EDGE] vectorized   |
|                   SHUFFLE [RS_37]                  |
|                     PartitionCols:_col0            |
|                     Select Operator [SEL_36] (rows=9500000 width=7) |
|                       Output:["_col0"]             |
|                       Filter Operator [FIL_35] (rows=9500000 width=7) |
|                         predicate:ws_order_number is not null |
|                         TableScan [TS_3] (rows=10000000 width=7) |
|                           
tpcds@tt2,ws2,Tbl:COMPLETE,Col:NONE,Output:["ws_order_number"] |
|                                                    |
+----------------------------------------------------+
{noformat}
Note that there's a MergeJoin below the MapJoin (antijoin). 

> TPC-DS query 16, 69, 94 return wrong results.
> ---------------------------------------------
>
>                 Key: HIVE-26659
>                 URL: https://issues.apache.org/jira/browse/HIVE-26659
>             Project: Hive
>          Issue Type: Sub-task
>    Affects Versions: 4.0.0-alpha-2
>            Reporter: Sungwoo Park
>            Priority: Major
>
> TPC-DS query 16, 69, 94 return wrong results when hive.auto.convert.anti.join 
> is set to true.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to