David Lavati reopened HIVE-22098:

> Data loss occurs when multiple tables are join with different bucket_version
> ----------------------------------------------------------------------------
>                 Key: HIVE-22098
>                 URL: https://issues.apache.org/jira/browse/HIVE-22098
>             Project: Hive
>          Issue Type: Bug
>          Components: Operators
>    Affects Versions: 3.1.0
>            Reporter: LuGuangMing
>            Assignee: LuGuangMing
>            Priority: Major
>         Attachments: HIVE-22098.1.patch, image-2019-08-12-18-45-15-771.png, 
> join_test.sql, table_a_data.orc, table_b_data.orc, table_c_data.orc
> When different bucketVersion of tables do join and  reducers number greater 
> than 2, result is easy to lose data.
> *Scenario 1*: Three tables join. The temporary result data of table_a in the 
> first table and table_b in the second table joins result is recorded as 
> tmp_a_b, When it joins with the third table, the bucket_version=2 of the 
> table created by default after hive-3.0.0, temporary data tmp_a_b initialized 
> the bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In 
> the init method, the hash algorithm of selecting join column is selected 
> according to bucketVersion. If bucketVersion = 2 and is not an acid 
> operation, it will acquired the new algorithm of hash. Otherwise, the old 
> algorithm of hash is acquired. Because of the inconsistency of the algorithm 
> of hash, the partition of data allocation caused are different. At stage of 
> Reducer, Data with the same key can not be paired resulting in data loss.
> *Scenario 2*: create two test tables, create table 
> table_bucketversion_1(col_1 string, col_2 string) TBLPROPERTIES 
> ('bucketing_version'='1'); table_bucketversion_2(col_1 string, col_2 string) 
> TBLPROPERTIES ('bucketing_version'='2');
> when use table_bucketversion_1 to join table_bucketversion_2, partial result 
> data will be loss due to bucketVerison is different.

This message was sent by Atlassian Jira

Reply via email to