LuGuangMing created HIVE-22098:
----------------------------------
Summary: Data loss occurs when joins occur on tables with
different bucket_version
Key: HIVE-22098
URL: https://issues.apache.org/jira/browse/HIVE-22098
Project: Hive
Issue Type: Bug
Components: Operators
Affects Versions: 3.1.0
Reporter: LuGuangMing
Assignee: LuGuangMing
When different bucketVersion of tables do join and reducers number greater
than 2, result is easy to lose data.
*Scenario 1*: Three tables join. The temporary result data of table_a in the
first table and table_b in the second table joins result is recorded as
tmp_a_b, When it joins with the third table, the bucket_version=2 of the table
created by default after hive-3.0.0, temporary data tmp_a_b initialized the
bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In the
init method, the hash algorithm of selecting join column is selected according
to bucketVersion. If bucketVersion = 2 and is not an acid operation, it will
acquired the new algorithm of hash. Otherwise, the old algorithm of hash is
acquired. Because of the inconsistency of the algorithm of hash, the partition
of data allocation caused are different. At stage of Reducer, Data with the
same key can not be paired resulting in data loss.
*Scenario 2*: create two test tables, create table table_bucketversion_1(col_1
string, col_2 string) TBLPROPERTIES ('bucketing_version'='1');
table_bucketversion_2(col_1 string, col_2 string) TBLPROPERTIES
('bucketing_version'='2');
when use table_bucketversion_1 to join table_bucketversion_2, partial result
data will be loss due to bucketVerison is different.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)