[
https://issues.apache.org/jira/browse/HIVE-27985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chenyu Zheng updated HIVE-27985:
--------------------------------
Description:
1 background
Hive on Tez occasionally produces duplicated files, especially speculative
execution is enable. Hive identifies and removes duplicate files through
removeTempOrDuplicateFiles. However, this logic often does not take effect. For
example, the killed task attempt may commit files during the execution of this
method. Or the files under HIVE_UNION_SUBDIR_X are not recognized during union
all. There are many issues to solve these problems, mainly focusing on how to
identify duplicate files. **This issue mainly solves this problem by avoiding
the generation of duplicate files.**
2 How Tez avoids duplicate files?
After testing, I found that Hadoop MapReduce examples and Tez examples do not
have this problem. Through OutputCommitter, duplicate files can be avoided if
designed properly. Let's analyze how Tez avoids duplicate files.
> Compared with Tez, Hadoop MapReduce has one more commitPending, which is not
> critical, so only analyzing Tez.
> Avoid duplicate files.
> ----------------------
>
> Key: HIVE-27985
> URL: https://issues.apache.org/jira/browse/HIVE-27985
> Project: Hive
> Issue Type: Bug
> Components: Tez
> Reporter: Chenyu Zheng
> Assignee: Chenyu Zheng
> Priority: Major
>
> 1 background
> Hive on Tez occasionally produces duplicated files, especially speculative
> execution is enable. Hive identifies and removes duplicate files through
> removeTempOrDuplicateFiles. However, this logic often does not take effect.
> For example, the killed task attempt may commit files during the execution of
> this method. Or the files under HIVE_UNION_SUBDIR_X are not recognized during
> union all. There are many issues to solve these problems, mainly focusing on
> how to identify duplicate files. **This issue mainly solves this problem by
> avoiding the generation of duplicate files.**
> 2 How Tez avoids duplicate files?
> After testing, I found that Hadoop MapReduce examples and Tez examples do not
> have this problem. Through OutputCommitter, duplicate files can be avoided if
> designed properly. Let's analyze how Tez avoids duplicate files.
> > Compared with Tez, Hadoop MapReduce has one more commitPending, which is
> > not critical, so only analyzing Tez.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)