[
https://issues.apache.org/jira/browse/HIVE-22098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17216551#comment-17216551
]
Zoltan Haindrich commented on HIVE-22098:
-----------------------------------------
I somehow missed this ticket - note that HIVE-21304 have fixed a few issues
with bucketing_version related stuff...so this might be fixed on master
> Data loss occurs when multiple tables are join with different bucket_version
> ----------------------------------------------------------------------------
>
> Key: HIVE-22098
> URL: https://issues.apache.org/jira/browse/HIVE-22098
> Project: Hive
> Issue Type: Bug
> Components: Operators
> Affects Versions: 3.1.0, 3.1.2
> Reporter: GuangMing Lu
> Assignee: yongtaoliao
> Priority: Blocker
> Labels: data-loss, wrongresults
> Attachments: HIVE-22098.1.patch, image-2019-08-12-18-45-15-771.png,
> join_test.sql, table_a_data.orc, table_b_data.orc, table_c_data.orc
>
>
> When different bucketVersion of tables do join and no of reducers is greater
> than 2, the result is incorrect (*data loss*).
> *Scenario 1*: Three tables join. The temporary result data of table_a in the
> first table and table_b in the second table joins result is recorded as
> tmp_a_b, When it joins with the third table, the bucket_version=2 of the
> table created by default after hive-3.0.0, temporary data tmp_a_b initialized
> the bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In
> the init method, the hash algorithm of selecting join column is selected
> according to bucketVersion. If bucketVersion = 2 and is not an acid
> operation, it will acquired the new algorithm of hash. Otherwise, the old
> algorithm of hash is acquired. Because of the inconsistency of the algorithm
> of hash, the partition of data allocation caused are different. At stage of
> Reducer, Data with the same key can not be paired resulting in data loss.
> *Scenario 2*: create two test tables, create table
> table_bucketversion_1(col_1 string, col_2 string) TBLPROPERTIES
> ('bucketing_version'='1'); table_bucketversion_2(col_1 string, col_2 string)
> TBLPROPERTIES ('bucketing_version'='2');
> when use table_bucketversion_1 to join table_bucketversion_2, partial result
> data will be loss due to bucketVerison is different.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)