[
https://issues.apache.org/jira/browse/HIVE-27985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chenyu Zheng updated HIVE-27985:
--------------------------------
Attachment: how tez examples commit.png
> Avoid duplicate files.
> ----------------------
>
> Key: HIVE-27985
> URL: https://issues.apache.org/jira/browse/HIVE-27985
> Project: Hive
> Issue Type: Bug
> Components: Tez
> Reporter: Chenyu Zheng
> Assignee: Chenyu Zheng
> Priority: Major
> Attachments: how tez examples commit.png
>
>
> 1 background
> Hive on Tez occasionally produces duplicated files, especially speculative
> execution is enable. Hive identifies and removes duplicate files through
> removeTempOrDuplicateFiles. However, this logic often does not take effect.
> For example, the killed task attempt may commit files during the execution of
> this method. Or the files under HIVE_UNION_SUBDIR_X are not recognized during
> union all. There are many issues to solve these problems, mainly focusing on
> how to identify duplicate files. **This issue mainly solves this problem by
> avoiding the generation of duplicate files.**
> 2 How Tez avoids duplicate files?
> After testing, I found that Hadoop MapReduce examples and Tez examples do not
> have this problem. Through OutputCommitter, duplicate files can be avoided if
> designed properly. Let's analyze how Tez avoids duplicate files.
> > Compared with Tez, Hadoop MapReduce has one more commitPending, which is
> > not critical, so only analyzing Tez.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)