Seonggon Namgung created HIVE-27226:
---------------------------------------

             Summary: FullOuterJoin with filter expressions is not computed 
correctly
                 Key: HIVE-27226
                 URL: https://issues.apache.org/jira/browse/HIVE-27226
             Project: Hive
          Issue Type: Bug
            Reporter: Seonggon Namgung


I tested many OuterJoin queries as an extension of HIVE-27138, and I found that 
Hive returns incorrect result for a query containing FullOuterJoin with filter 
expressions. In a nutshell, all JoinOperators that run on Tez engine return 
incorrect result for OuterJoin queries, and one of the reason for incorrect 
computation comes from CommonJoinOperator, which is the base of all 
JoinOperators. I attached the queries and configuration that I used at the 
bottom of the document. I am still inspecting this problems, and I will share 
an update once when I find out another reason. Also any comments and opinions 
would be appreciated.


First of all, I observed that current Hive ignores filter expressions contained 
in MapJoinOperator. For example, the attached result of query1 shows that 
MapJoinOperator performs inner join, not full outer join. This problem stems 
from removal of filterMap. When converting JoinOperator to MapJoinOperator, 
ConvertJoinMapJoin#convertJoinDynamicPartitionedHashJoin() removes filterMap of 
MapJoinOperator. Because MapJoinOperator does not evaluate filter expressions 
if filterMap is null, this change makes MapJoinOperator ignore filter 
expressions and it always joins tables regardless whether they satisfy filter 
expressions or not. To solve this problem, I disable 
FullOuterMapJoinOptimization and apply path for HIVE-27138, which prevents NPE. 
(The patch is available at the following link: LINK.) The rest of this document 
uses this modified Hive, but most of problems happen to current Hive, too.


The second problem I found is that Hive returns the same left-null or 
right-null rows multiple time when it uses MapJoinOperator or 
CommonMergeJoinOperator. This is caused by the logic of current 
CommonJoinOperator. Both of the two JoinOperators joins tables in 2 steps. 
First, they create RowContainers, each of which is a group of rows from one 
table and has the same key. Second, they call 
CommonJoinOperator#checkAndGenObject() with created RowContainers. This method 
checks filterTag of each row in RowContainers and forwards joined row if they 
meet all filter conditions. For OuterJoin, checkAndGenObject() forwards 
non-matching rows if there is no matching row in RowContainer. The problem 
happens when there are multiple RowContainer for the same key and table. For 
example, suppose that there are two left RowContainers and one right 
RowContainer. If none of the row in two left RowContainers satisfies filter 
condition, then checkAndGenObject() will forward Left-Null row for each right 
row. Because checkAndGenObject() is called with each left RowContainer, there 
will be two duplicated Left-Null rows for every right row.


In the case of MapJoinOperator, it always creates singleton RowContainer for 
big table. Therefore, it always produces duplicated non-matching rows. 
CommonMergeJoinOperator also creates multiple RowContainer for big table, whose 
size is hive.join.emit.interval. In the below experiment, I also set 
hive.join.shortcut.unmatched.rows=false, and hive.exec.reducers.max=1 to 
disable specialized algorithm for OuterJoin of 2 tables and force calling 
checkAndGenObject() before all rows with the same keys are gathered. I didn't 
observe this problem when using VectorMapJoinOperator, and I will inspect 
VectorMapJoinOperator whether we can reproduce the problem with it.


I think the second problem is not limited to FullOuterJoin, but I couldn't find 
such query as of now. This will also be added to this issue if I can write a 
query that reproduces the second problem without FullOuterJoin.


I also found that Hive returns wrong result for query2 even when I used 
VectorMapJoinOperator. I am still inspecting this problem and I will add an 
update on it when I find out the reason.

 

Experiment:

 
{code:java}
---- Configuration
set hive.optimize.shared.work=false;

-- Std MapJoin
set hive.auto.convert.join=true;
set hive.vectorized.execution.enabled=false;

-- Vec MapJoin
set hive.auto.convert.join=true;
set hive.vectorized.execution.enabled=true;

-- MergeJoin
set hive.auto.convert.join=false;
set hive.vectorized.execution.enabled=false;
set hive.join.shortcut.unmatched.rows=false;
set hive.join.emit.interval=1;
set hive.exec.reducers.max=1;
 
---- Queries
-- Query 1
DROP TABLE IF EXISTS a;
CREATE TABLE a (key string, value string);
INSERT INTO a VALUES (1, 1), (1, 2), (2, 1);
SELECT * FROM a FULL OUTER JOIN a b ON a.key = b.key AND a.key < 0;

-- Query 2
DROP TABLE IF EXISTS b;
CREATE TABLE b (key string, value string);
INSERT INTO b VALUES (1, 0), (1, 1);
SELECT * FROM b FULL OUTER JOIN b a ON a.key = b.key AND a.value > 0 AND 
b.value > 0;{code}
 

 

Experiment result:

 
{code:java}
-- PostgresSQL
-- Query1
key | value | key | value
-----+-------+-----+-------
   1 |     1 |     |
   1 |     2 |     |
   2 |     1 |     |
     |       |   1 |     2
     |       |   1 |     1
     |       |   2 |     1
(6 rows)

-- Query2
 key | value | key | value
-----+-------+-----+-------
   1 |     0 |     |      
   1 |     1 |   1 |     1
     |       |   1 |     0
(3 rows){code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to