[jira] [Updated] (HIVE-25836) Tez union all operation may cause duplicate data

Yao Guangdong (Jira) Wed, 29 Dec 2021 19:10:04 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-25836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yao Guangdong updated HIVE-25836:
---------------------------------
    Target Version/s: 2.3.10

> Tez union all operation may cause duplicate data
> ------------------------------------------------
>
>                 Key: HIVE-25836
>                 URL: https://issues.apache.org/jira/browse/HIVE-25836
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 2.3.0, 2.3.8
>            Reporter: Yao Guangdong
>            Assignee: Yao Guangdong
>            Priority: Critical
>         Attachments: HIVE-25836.0001.patch
>
>
> When we use tez union all operation.Which will cause some duplicate data in 
> some cases. Which is because tez union all operation can generate sub 
> directory in the table or parition directory.The sub directory use number as 
> name and the result data file will stored in sub directory.If the sub 
> directory have the speculate task execute and the speculate task's result 
> file also in the sub directory.The hive client will delete duplicate task's 
> file when the job finished.The hive client only check one level have the 
> duplicate task's file.Because  the sub directory's exsist. Which make the sub 
> directory's duplicate task's file not delete and the duplicate data happened.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HIVE-25836) Tez union all operation may cause duplicate data

Reply via email to