[
https://issues.apache.org/jira/browse/HIVE-22098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17042715#comment-17042715
]
JithendhiraKumar commented on HIVE-22098:
-----------------------------------------
[~luguangming] has already mentioned the steps to reproduce *Scenario 1.*
Here are Steps To Reproduce *Scenario 2*: (test_data.csv can be found in the
attachments)
{code:java}
CREATE TABLE `join_test_1`(`id` string, `first` string, `email` string) ROW
FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH
SERDEPROPERTIES ('field.delim'=',', 'serialization.format'=',') STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' TBLPROPERTIES
('bucketing_version'='1');
LOAD DATA LOCAL INPATH '/uploads/test_data.csv' OVERWRITE INTO TABLE
join_test_1;
CREATE TABLE `join_test_2`(`id` string, `first` string, `email` string) ROW
FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH
SERDEPROPERTIES ('field.delim'=',', 'serialization.format'=',') STORED AS
INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' TBLPROPERTIES
('bucketing_version'='2');
LOAD DATA LOCAL INPATH '/uploads/test_data.csv' OVERWRITE INTO TABLE
join_test_2;
Query:
set mapred.reduce.tasks=2;
set hive.auto.convert.join=false;
SELECT * from (SELECT id from join_test_1) as tbl1 LEFT JOIN (SELECT id from
join_test_2) as tbl2 on tbl1.id = tbl2.id;
OutPut: (Wrong Results/Data Loss)
+----------+----------+
| tbl1.id | tbl2.id |
+----------+----------+
| 0 | NULL |
| 2 | NULL |
| 4 | NULL |
| 6 | NULL |
| 8 | 8 |
| 1 | NULL |
| 3 | NULL |
| 5 | 5 |
| 7 | NULL |
| 9 | NULL |
+----------+----------+
Expected Result:
+----------+----------+
| tbl1.id | tbl2.id |
+----------+----------+
| 1 | 1 |
| 3 | 3 |
| 7 | 7 |
| 8 | 8 |
| 9 | 9 |
| 0 | 0 |
| 2 | 2 |
| 4 | 4 |
| 5 | 5 |
| 6 | 6 |
+----------+----------+
{code}
.
> Data loss occurs when multiple tables are join with different bucket_version
> ----------------------------------------------------------------------------
>
> Key: HIVE-22098
> URL: https://issues.apache.org/jira/browse/HIVE-22098
> Project: Hive
> Issue Type: Bug
> Components: Operators
> Affects Versions: 3.1.0, 3.1.2
> Reporter: LuGuangMing
> Assignee: LuGuangMing
> Priority: Blocker
> Labels: data-loss, wrongresults
> Attachments: HIVE-22098.1.patch, image-2019-08-12-18-45-15-771.png,
> join_test.sql, table_a_data.orc, table_b_data.orc, table_c_data.orc
>
>
> When different bucketVersion of tables do join and no of reducers is greater
> than 2, the result is incorrect (*data loss*).
> *Scenario 1*: Three tables join. The temporary result data of table_a in the
> first table and table_b in the second table joins result is recorded as
> tmp_a_b, When it joins with the third table, the bucket_version=2 of the
> table created by default after hive-3.0.0, temporary data tmp_a_b initialized
> the bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In
> the init method, the hash algorithm of selecting join column is selected
> according to bucketVersion. If bucketVersion = 2 and is not an acid
> operation, it will acquired the new algorithm of hash. Otherwise, the old
> algorithm of hash is acquired. Because of the inconsistency of the algorithm
> of hash, the partition of data allocation caused are different. At stage of
> Reducer, Data with the same key can not be paired resulting in data loss.
> *Scenario 2*: create two test tables, create table
> table_bucketversion_1(col_1 string, col_2 string) TBLPROPERTIES
> ('bucketing_version'='1'); table_bucketversion_2(col_1 string, col_2 string)
> TBLPROPERTIES ('bucketing_version'='2');
> when use table_bucketversion_1 to join table_bucketversion_2, partial result
> data will be loss due to bucketVerison is different.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)