[
https://issues.apache.org/jira/browse/HIVE-21481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HIVE-21481:
----------------------------------
Labels: pull-request-available (was: )
> MERGE correctness issues with null safe equality
> ------------------------------------------------
>
> Key: HIVE-21481
> URL: https://issues.apache.org/jira/browse/HIVE-21481
> Project: Hive
> Issue Type: Bug
> Components: Query Planning
> Reporter: Vineet Garg
> Assignee: Krisztian Kasa
> Priority: Major
> Labels: pull-request-available
>
> The way Hive currently generates plan for MERGE statement can lead to wrong
> results with null safe equality.
> To illustrate consider the following reproducer
> {code:sql}
> create table ttarget(s string, j int, flag string) stored as orc
> tblproperties("transactional"="true");
> truncate table ttarget;
> insert into ttarget values('not_null', 1, 'dont udpate'), (null,2, 'update');
> create table tsource (i int);
> insert into tsource values(null),(2);
> {code}
> Let's say you have the following MERGE statement
> {code:sql}
> explain merge into ttarget using tsource on i<=>j
> when matched THEN
> UPDATE set flag='updated'
> when not matched THEN
> INSERT VALUES('new', 1999, 'true');
> {code}
> With this MERGE {{*ONLY ONE*}} row should match in target which should be
> updated. But currently due to the plan hive generate it will end up matching
> both rows.
> This is because MERGE statement is rewritten into RIGHT OUTER JOIN + FILTER
> corresponding to all branches.
> The part of the plan generated by hive for this statement consist of:
> {noformat}
> Map 2
> Map Operator Tree:
> TableScan
> alias: tsource
> Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE
> Column stats: NONE
> Map Join Operator
> condition map:
> Right Outer Join 0 to 1
> keys:
> 0 j (type: int)
> 1 i (type: int)
> nullSafes: [true]
> outputColumnNames: _col0, _col1, _col5, _col6
> input vertices:
> 0 Map 1
> Statistics: Num rows: 1 Data size: 206 Basic stats:
> COMPLETE Column stats: NONE
> HybridGraceHashJoin: true
> Filter Operator
> predicate: (_col6 IS NOT DISTINCT FROM _col1) (type:
> boolean)
> Statistics: Num rows: 1 Data size: 206 Basic stats:
> COMPLETE Column stats: NONE
> Select Operator
> expressions: _col5 (type:
> struct<writeid:bigint,bucketid:int,rowid:bigint>), _col0 (type: string),
> _col1 (type: int)
> outputColumnNames: _col0, _col1, _col2
> Statistics: Num rows: 1 Data size: 206 Basic stats:
> COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: _col0 (type:
> struct<writeid:bigint,bucketid:int,rowid:bigint>)
> sort order: +
> Map-reduce partition columns: UDFToInteger(_col0)
> (type: int)
> Statistics: Num rows: 1 Data size: 206 Basic stats:
> COMPLETE Column stats: NONE
> value expressions: _col1 (type: string), _col2
> (type: int)
> {noformat}
> Result after JOIN will be :
> {code:sql}
> select s,j,i from ttarget right outer join tsource on i<=>j ;
> NULL NULL NULL
> NULL NULL 2
> {code}
> On this resultset predicate {{(_col6 IS NOT DISTINCT FROM _col1)}} will be
> true for both resulting into both rows matching.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)