[jira] [Updated] (HIVE-27985) Avoid duplicate files.

Chenyu Zheng (Jira) Sun, 07 Jan 2024 20:45:11 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-27985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chenyu Zheng updated HIVE-27985:
--------------------------------
    Description: 
1 background
Hive on Tez occasionally produces duplicated files, especially speculative 
execution is enable. Hive identifies and removes duplicate files through 
removeTempOrDuplicateFiles. However, this logic often does not take effect. For 
example, the killed task attempt may commit files during the execution of this 
method. Or the files under HIVE_UNION_SUBDIR_X are not recognized during union 
all. There are many issues to solve these problems, mainly focusing on how to 
identify duplicate files. **This issue mainly solves this problem by avoiding 
the generation of duplicate files.**


2 How Tez avoids duplicate files?

After testing, I found that Hadoop MapReduce examples and Tez examples do not 
have this problem. Through OutputCommitter, duplicate files can be avoided if 
designed properly. Let's analyze how Tez avoids duplicate files.

> Compared with Tez, Hadoop MapReduce has one more commitPending, which is not 
> critical, so only analyzing Tez.

 

 

> Avoid duplicate files.
> ----------------------
>
>                 Key: HIVE-27985
>                 URL: https://issues.apache.org/jira/browse/HIVE-27985
>             Project: Hive
>          Issue Type: Bug
>          Components: Tez
>            Reporter: Chenyu Zheng
>            Assignee: Chenyu Zheng
>            Priority: Major
>
> 1 background
> Hive on Tez occasionally produces duplicated files, especially speculative 
> execution is enable. Hive identifies and removes duplicate files through 
> removeTempOrDuplicateFiles. However, this logic often does not take effect. 
> For example, the killed task attempt may commit files during the execution of 
> this method. Or the files under HIVE_UNION_SUBDIR_X are not recognized during 
> union all. There are many issues to solve these problems, mainly focusing on 
> how to identify duplicate files. **This issue mainly solves this problem by 
> avoiding the generation of duplicate files.**
> 2 How Tez avoids duplicate files?
> After testing, I found that Hadoop MapReduce examples and Tez examples do not 
> have this problem. Through OutputCommitter, duplicate files can be avoided if 
> designed properly. Let's analyze how Tez avoids duplicate files.
> > Compared with Tez, Hadoop MapReduce has one more commitPending, which is 
> > not critical, so only analyzing Tez.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-27985) Avoid duplicate files.

Reply via email to