Youjun Yuan created HIVE-26111:
----------------------------------
Summary: FULL JOIN returns incorrect result with Tez engine
Key: HIVE-26111
URL: https://issues.apache.org/jira/browse/HIVE-26111
Project: Hive
Issue Type: Bug
Environment: aws EMR (hive 3.1.2 + Tez 0.10.1)
Reporter: Youjun Yuan
we hit a query which FULL JOINs two tables, hive produces incorrect results,
for a single value of join key, it produces two records, each record has a
valid value for one table and NULL for the other table.
The query is:
{code:java}
SELECT d.id, u.id
FROM (
SELECT id
FROM airflow.tableA rud
WHERE rud.dt = '2022-04-02-1row'
) d
FULL JOIN (
SELECT id
FROM default.tableB
WHERE dt = '2022-04-01' and device_token='blabla'
) u
ON u.id = d.id
; {code}
And produces two records for id=350570497
{code:java}
350570497 NULL
NULL 350570497
Time taken: 62.692 seconds, Fetched: 2 row(s) {code}
I am sure tableB has only one row where device_token='blabla'And we tried: 1,
SET mapreduce.job.reduces=1; then it produces right result;2, SET
hive.execution.engine=mr; then it produces right result;3, JOIN (instead of
FULL JOIN) worked as expected 4, in sub query u, change filter
device_token='blabla' to id=350570497, it worked okBelow is the explain output
of the query:
{code:java}
Plan optimized by CBO.Vertex dependency in root stage
Reducer 3 <- Map 1 (CUSTOM_SIMPLE_EDGE), Map 2 (CUSTOM_SIMPLE_EDGE)Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 3
File Output Operator [FS_10]
Map Join Operator [MAPJOIN_13] (rows=2 width=8)
Conds:RS_6.KEY.reducesinkkey0=RS_7.KEY.reducesinkkey0(Outer),DynamicPartitionHashJoin:true,Output:["_col0","_col1"]
<-Map 1 [CUSTOM_SIMPLE_EDGE]
PARTITION_ONLY_SHUFFLE [RS_6]
PartitionCols:_col0
Select Operator [SEL_2] (rows=1 width=4)
Output:["_col0"]
TableScan [TS_0] (rows=1 width=4)
airflow@rds_users_delta,rud,Tbl:COMPLETE,Col:COMPLETE,Output:["id"]
<-Map 2 [CUSTOM_SIMPLE_EDGE]
PARTITION_ONLY_SHUFFLE [RS_7]
PartitionCols:_col0
Select Operator [SEL_5] (rows=1 width=4)
Output:["_col0"]
Filter Operator [FIL_12] (rows=1 width=110)
predicate:(device_token = 'blabla')
TableScan [TS_3] (rows=215192362 width=109)
default@users,users,Tbl:COMPLETE,Col:COMPLETE,Output:["id","device_token"]
{code}
I can't generate a small enough result set to reproduce the issue, I have
minimized the tableA to only 1 row, tableB has ~200m rows, but if I further
reduce the size of tableB, then the issue can't be reproduced. any suggestion
would be highly appreciated, regarding the root cause of the issue, how to work
around it, or how to reproduce it with small enough dataset.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)