[
https://issues.apache.org/jira/browse/HIVE-25836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yao Guangdong updated HIVE-25836:
---------------------------------
Target Version/s: 2.3.10
> Tez union all operation may cause duplicate data
> ------------------------------------------------
>
> Key: HIVE-25836
> URL: https://issues.apache.org/jira/browse/HIVE-25836
> Project: Hive
> Issue Type: Bug
> Components: Hive
> Affects Versions: 2.3.0, 2.3.8
> Reporter: Yao Guangdong
> Assignee: Yao Guangdong
> Priority: Critical
> Attachments: HIVE-25836.0001.patch
>
>
> When we use tez union all operation.Which will cause some duplicate data in
> some cases. Which is because tez union all operation can generate sub
> directory in the table or parition directory.The sub directory use number as
> name and the result data file will stored in sub directory.If the sub
> directory have the speculate task execute and the speculate task's result
> file also in the sub directory.The hive client will delete duplicate task's
> file when the job finished.The hive client only check one level have the
> duplicate task's file.BecauseĀ the sub directory's exsist. Which make the sub
> directory's duplicate task's file not delete and the duplicate data happened.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)