[
https://issues.apache.org/jira/browse/HIVE-27985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chenyu Zheng updated HIVE-27985:
--------------------------------
Description:
*1 introducation*
Hive on Tez occasionally produces duplicated files, especially speculative
execution is enable. Hive identifies and removes duplicate files through
removeTempOrDuplicateFiles. However, this logic often does not take effect. For
example, the killed task attempt may commit files during the execution of this
method. Or the files under HIVE_UNION_SUBDIR_X are not recognized during union
all. There are many issues to solve these problems, mainly focusing on how to
identify duplicate files. *This issue mainly solves this problem by avoiding
the generation of duplicate files.*
*2 How Tez avoids duplicate files?*
After testing, I found that Hadoop MapReduce examples and Tez examples do not
have this problem. Through OutputCommitter, duplicate files can be avoided if
designed properly. Let's analyze how Tez avoids duplicate files.
{color:#172b4d} _Note: Compared with Tez, Hadoop MapReduce has one more
commitPending, which is not critical, so only analyzing Tez._{color}
!how tez examples commit.png|width=778,height=483!
Let’s analyze this step:
* (1) {*}process records{*}: Process records.
* (2) {*}send canCommit request{*}: After all Records are processed, call
canCommit remotely to AM.
* (3) {*}update commitAttempt{*}: After AM receives the canCommit request, it
will check whether there are other tasksattempts in the current task that have
already executed canCommit. If there is no other taskattempt to execute
canCommit first, return true. Otherwise return false. This ensures that only
one taskattempt is committed for each task.
* (4) {*}return canCommit response{*}: Task receives AM's response. If returns
true, it means it can be committed. If false is returned, it means that another
task attempt has already executed the commit first, and you cannot commit. The
task will jump into (2) loop to execute canCommit until it is killed or other
tasks fail.
* (5) {*}output.commit{*}: Execute commit, specifically rename the generated
temporary file to the final file.
* (6) {*}notify succeeded{*}: Although the task has completed the final file,
AM still needs to be notified that its work is completed. Therefore, AM needs
to be notified through heartbeat that the current task attempt has been
completed.
There is a problem in the above steps. That is, if an exception occurs in the
task after (5) and before (6), AM does not know that the Task attempt has been
completed, so AM will still start a new task attempt, and the new task attempt
will generate a new file, so It will cause duplication. I added code for
randomly throwing exceptions between (5) and (6), and found that in fact, Tez
example did not produce data duplication. Why? Mainly because the final file
generated by which task attempt is the same is the same. When a new task
attempt commits and finds that the final file exists (this file was generated
by the previous task attempt), it will be deleted firstly, then renamed.
Regardless of whether the previous task attempt was committed normally, the
last successful task will clear the previous error results.
To summarize, tez-examples uses two methods to avoid duplicate files:
* (1) Avoid repeated commit through canCommit. This is particularly effective
for tasks with speculative execution turned on.
* (2) The final file names generated by different task attempts are the same.
Combined with canCommit, it can be guaranteed that only one file generated in
the end, and it can only be generated by a successful task attempt.
*3 Why can't Hive on Tez avoid duplicate files?*
Hive on Tez does not have the two mechanisms mentioned in the Tez example.
First of all, Hive on Tez does not call canCommit.TezProcessor inherited from
AbstractLogicalIOProcessor. The logic of canCommit in Tez examples is mainly in
SimpleMRProcessor.
Secondly, the file names generated for each file under Hive on Tez are not
same. The file generated by the first attempt of a task is 000000_0, and the
file generated by the second attempt is 000000_1.
*4 How to improve?*
Use canCommit to ensure that speculative tasks will not be submitted at the
same time. (HIVE-27899)
Let different task attempts for each task generate the same final file name.
(HIVE-27986)
was:
1 background
Hive on Tez occasionally produces duplicated files, especially speculative
execution is enable. Hive identifies and removes duplicate files through
removeTempOrDuplicateFiles. However, this logic often does not take effect. For
example, the killed task attempt may commit files during the execution of this
method. Or the files under HIVE_UNION_SUBDIR_X are not recognized during union
all. There are many issues to solve these problems, mainly focusing on how to
identify duplicate files. **This issue mainly solves this problem by avoiding
the generation of duplicate files.**
2 How Tez avoids duplicate files?
After testing, I found that Hadoop MapReduce examples and Tez examples do not
have this problem. Through OutputCommitter, duplicate files can be avoided if
designed properly. Let's analyze how Tez avoids duplicate files.
> Compared with Tez, Hadoop MapReduce has one more commitPending, which is not
> critical, so only analyzing Tez.
> Avoid duplicate files.
> ----------------------
>
> Key: HIVE-27985
> URL: https://issues.apache.org/jira/browse/HIVE-27985
> Project: Hive
> Issue Type: Bug
> Components: Tez
> Reporter: Chenyu Zheng
> Assignee: Chenyu Zheng
> Priority: Major
> Attachments: how tez examples commit.png
>
>
> *1 introducation*
> Hive on Tez occasionally produces duplicated files, especially speculative
> execution is enable. Hive identifies and removes duplicate files through
> removeTempOrDuplicateFiles. However, this logic often does not take effect.
> For example, the killed task attempt may commit files during the execution of
> this method. Or the files under HIVE_UNION_SUBDIR_X are not recognized during
> union all. There are many issues to solve these problems, mainly focusing on
> how to identify duplicate files. *This issue mainly solves this problem by
> avoiding the generation of duplicate files.*
> *2 How Tez avoids duplicate files?*
> After testing, I found that Hadoop MapReduce examples and Tez examples do not
> have this problem. Through OutputCommitter, duplicate files can be avoided if
> designed properly. Let's analyze how Tez avoids duplicate files.
> {color:#172b4d} _Note: Compared with Tez, Hadoop MapReduce has one more
> commitPending, which is not critical, so only analyzing Tez._{color}
> !how tez examples commit.png|width=778,height=483!
>
> Let’s analyze this step:
> * (1) {*}process records{*}: Process records.
> * (2) {*}send canCommit request{*}: After all Records are processed, call
> canCommit remotely to AM.
> * (3) {*}update commitAttempt{*}: After AM receives the canCommit request,
> it will check whether there are other tasksattempts in the current task that
> have already executed canCommit. If there is no other taskattempt to execute
> canCommit first, return true. Otherwise return false. This ensures that only
> one taskattempt is committed for each task.
> * (4) {*}return canCommit response{*}: Task receives AM's response. If
> returns true, it means it can be committed. If false is returned, it means
> that another task attempt has already executed the commit first, and you
> cannot commit. The task will jump into (2) loop to execute canCommit until it
> is killed or other tasks fail.
> * (5) {*}output.commit{*}: Execute commit, specifically rename the generated
> temporary file to the final file.
> * (6) {*}notify succeeded{*}: Although the task has completed the final
> file, AM still needs to be notified that its work is completed. Therefore, AM
> needs to be notified through heartbeat that the current task attempt has been
> completed.
> There is a problem in the above steps. That is, if an exception occurs in the
> task after (5) and before (6), AM does not know that the Task attempt has
> been completed, so AM will still start a new task attempt, and the new task
> attempt will generate a new file, so It will cause duplication. I added code
> for randomly throwing exceptions between (5) and (6), and found that in fact,
> Tez example did not produce data duplication. Why? Mainly because the final
> file generated by which task attempt is the same is the same. When a new task
> attempt commits and finds that the final file exists (this file was generated
> by the previous task attempt), it will be deleted firstly, then renamed.
> Regardless of whether the previous task attempt was committed normally, the
> last successful task will clear the previous error results.
> To summarize, tez-examples uses two methods to avoid duplicate files:
> * (1) Avoid repeated commit through canCommit. This is particularly
> effective for tasks with speculative execution turned on.
> * (2) The final file names generated by different task attempts are the
> same. Combined with canCommit, it can be guaranteed that only one file
> generated in the end, and it can only be generated by a successful task
> attempt.
> *3 Why can't Hive on Tez avoid duplicate files?*
> Hive on Tez does not have the two mechanisms mentioned in the Tez example.
> First of all, Hive on Tez does not call canCommit.TezProcessor inherited from
> AbstractLogicalIOProcessor. The logic of canCommit in Tez examples is mainly
> in SimpleMRProcessor.
> Secondly, the file names generated for each file under Hive on Tez are not
> same. The file generated by the first attempt of a task is 000000_0, and the
> file generated by the second attempt is 000000_1.
> *4 How to improve?*
> Use canCommit to ensure that speculative tasks will not be submitted at the
> same time. (HIVE-27899)
> Let different task attempts for each task generate the same final file name.
> (HIVE-27986)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)