[jira] [Updated] (HIVE-27985) Avoid duplicate files.

Chenyu Zheng (Jira) Sun, 07 Jan 2024 20:51:03 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-27985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chenyu Zheng updated HIVE-27985:
--------------------------------
    Description: 
*1 introducation*
Hive on Tez occasionally produces duplicated files, especially speculative 
execution is enable. Hive identifies and removes duplicate files through 
removeTempOrDuplicateFiles. However, this logic often does not take effect. For 
example, the killed task attempt may commit files during the execution of this 
method. Or the files under HIVE_UNION_SUBDIR_X are not recognized during union 
all. There are many issues to solve these problems, mainly focusing on how to 
identify duplicate files. *This issue mainly solves this problem by avoiding 
the generation of duplicate files.*

*2 How Tez avoids duplicate files?*

After testing, I found that Hadoop MapReduce examples and Tez examples do not 
have this problem. Through OutputCommitter, duplicate files can be avoided if 
designed properly. Let's analyze how Tez avoids duplicate files.

{color:#172b4d} _Note: Compared with Tez, Hadoop MapReduce has one more 
commitPending, which is not critical, so only analyzing Tez._{color}

!how tez examples commit.png|width=778,height=483!

 

Let’s analyze this step:
 * (1) {*}process records{*}: Process records.
 * (2) {*}send canCommit request{*}: After all Records are processed, call 
canCommit remotely to AM.
 * (3) {*}update commitAttempt{*}: After AM receives the canCommit request, it 
will check whether there are other tasksattempts in the current task that have 
already executed canCommit. If there is no other taskattempt to execute 
canCommit first, return true. Otherwise return false. This ensures that only 
one taskattempt is committed for each task.
 * (4) {*}return canCommit response{*}: Task receives AM's response. If returns 
true, it means it can be committed. If false is returned, it means that another 
task attempt has already executed the commit first, and you cannot commit. The 
task will jump into (2) loop to execute canCommit until it is killed or other 
tasks fail.
 * (5) {*}output.commit{*}: Execute commit, specifically rename the generated 
temporary file to the final file.
 * (6) {*}notify succeeded{*}: Although the task has completed the final file, 
AM still needs to be notified that its work is completed. Therefore, AM needs 
to be notified through heartbeat that the current task attempt has been 
completed.

There is a problem in the above steps. That is, if an exception occurs in the 
task after (5) and before (6), AM does not know that the Task attempt has been 
completed, so AM will still start a new task attempt, and the new task attempt 
will generate a new file, so It will cause duplication. I added code for 
randomly throwing exceptions between (5) and (6), and found that in fact, Tez 
example did not produce data duplication. Why? Mainly because the final file 
generated by which task attempt is the same is the same. When a new task 
attempt commits and finds that the final file exists (this file was generated 
by the previous task attempt), it will be deleted firstly, then renamed. 
Regardless of whether the previous task attempt was committed normally, the 
last successful task will clear the previous error results.

To summarize, tez-examples uses two methods to avoid duplicate files:
 * (1) Avoid repeated commit through canCommit. This is particularly effective 
for tasks with speculative execution turned on.
 * (2) The final file names generated by different task attempts are the same. 
Combined with canCommit, it can be guaranteed that only one file generated in 
the end, and it can only be generated by a successful task attempt.

*3 Why can't Hive on Tez avoid duplicate files?*

Hive on Tez does not have the two mechanisms mentioned in the Tez example.
First of all, Hive on Tez does not call canCommit.TezProcessor inherited from 
AbstractLogicalIOProcessor. The logic of canCommit in Tez examples is mainly in 
SimpleMRProcessor.
Secondly, the file names generated for each file under Hive on Tez are not 
same. The file generated by the first attempt of a task is 000000_0, and the 
file generated by the second attempt is 000000_1.


*4 How to improve?*

Use canCommit to ensure that speculative tasks will not be submitted at the 
same time. (HIVE-27899)
Let different task attempts for each task generate the same final file name. 
(HIVE-27986)

  was:
1 background
Hive on Tez occasionally produces duplicated files, especially speculative 
execution is enable. Hive identifies and removes duplicate files through 
removeTempOrDuplicateFiles. However, this logic often does not take effect. For 
example, the killed task attempt may commit files during the execution of this 
method. Or the files under HIVE_UNION_SUBDIR_X are not recognized during union 
all. There are many issues to solve these problems, mainly focusing on how to 
identify duplicate files. **This issue mainly solves this problem by avoiding 
the generation of duplicate files.**


2 How Tez avoids duplicate files?

After testing, I found that Hadoop MapReduce examples and Tez examples do not 
have this problem. Through OutputCommitter, duplicate files can be avoided if 
designed properly. Let's analyze how Tez avoids duplicate files.

> Compared with Tez, Hadoop MapReduce has one more commitPending, which is not 
> critical, so only analyzing Tez.

 

 


> Avoid duplicate files.
> ----------------------
>
>                 Key: HIVE-27985
>                 URL: https://issues.apache.org/jira/browse/HIVE-27985
>             Project: Hive
>          Issue Type: Bug
>          Components: Tez
>            Reporter: Chenyu Zheng
>            Assignee: Chenyu Zheng
>            Priority: Major
>         Attachments: how tez examples commit.png
>
>
> *1 introducation*
> Hive on Tez occasionally produces duplicated files, especially speculative 
> execution is enable. Hive identifies and removes duplicate files through 
> removeTempOrDuplicateFiles. However, this logic often does not take effect. 
> For example, the killed task attempt may commit files during the execution of 
> this method. Or the files under HIVE_UNION_SUBDIR_X are not recognized during 
> union all. There are many issues to solve these problems, mainly focusing on 
> how to identify duplicate files. *This issue mainly solves this problem by 
> avoiding the generation of duplicate files.*
> *2 How Tez avoids duplicate files?*
> After testing, I found that Hadoop MapReduce examples and Tez examples do not 
> have this problem. Through OutputCommitter, duplicate files can be avoided if 
> designed properly. Let's analyze how Tez avoids duplicate files.
> {color:#172b4d} _Note: Compared with Tez, Hadoop MapReduce has one more 
> commitPending, which is not critical, so only analyzing Tez._{color}
> !how tez examples commit.png|width=778,height=483!
>  
> Let’s analyze this step:
>  * (1) {*}process records{*}: Process records.
>  * (2) {*}send canCommit request{*}: After all Records are processed, call 
> canCommit remotely to AM.
>  * (3) {*}update commitAttempt{*}: After AM receives the canCommit request, 
> it will check whether there are other tasksattempts in the current task that 
> have already executed canCommit. If there is no other taskattempt to execute 
> canCommit first, return true. Otherwise return false. This ensures that only 
> one taskattempt is committed for each task.
>  * (4) {*}return canCommit response{*}: Task receives AM's response. If 
> returns true, it means it can be committed. If false is returned, it means 
> that another task attempt has already executed the commit first, and you 
> cannot commit. The task will jump into (2) loop to execute canCommit until it 
> is killed or other tasks fail.
>  * (5) {*}output.commit{*}: Execute commit, specifically rename the generated 
> temporary file to the final file.
>  * (6) {*}notify succeeded{*}: Although the task has completed the final 
> file, AM still needs to be notified that its work is completed. Therefore, AM 
> needs to be notified through heartbeat that the current task attempt has been 
> completed.
> There is a problem in the above steps. That is, if an exception occurs in the 
> task after (5) and before (6), AM does not know that the Task attempt has 
> been completed, so AM will still start a new task attempt, and the new task 
> attempt will generate a new file, so It will cause duplication. I added code 
> for randomly throwing exceptions between (5) and (6), and found that in fact, 
> Tez example did not produce data duplication. Why? Mainly because the final 
> file generated by which task attempt is the same is the same. When a new task 
> attempt commits and finds that the final file exists (this file was generated 
> by the previous task attempt), it will be deleted firstly, then renamed. 
> Regardless of whether the previous task attempt was committed normally, the 
> last successful task will clear the previous error results.
> To summarize, tez-examples uses two methods to avoid duplicate files:
>  * (1) Avoid repeated commit through canCommit. This is particularly 
> effective for tasks with speculative execution turned on.
>  * (2) The final file names generated by different task attempts are the 
> same. Combined with canCommit, it can be guaranteed that only one file 
> generated in the end, and it can only be generated by a successful task 
> attempt.
> *3 Why can't Hive on Tez avoid duplicate files?*
> Hive on Tez does not have the two mechanisms mentioned in the Tez example.
> First of all, Hive on Tez does not call canCommit.TezProcessor inherited from 
> AbstractLogicalIOProcessor. The logic of canCommit in Tez examples is mainly 
> in SimpleMRProcessor.
> Secondly, the file names generated for each file under Hive on Tez are not 
> same. The file generated by the first attempt of a task is 000000_0, and the 
> file generated by the second attempt is 000000_1.
> *4 How to improve?*
> Use canCommit to ensure that speculative tasks will not be submitted at the 
> same time. (HIVE-27899)
> Let different task attempts for each task generate the same final file name. 
> (HIVE-27986)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-27985) Avoid duplicate files.

Reply via email to