[
https://issues.apache.org/jira/browse/HIVE-27985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805046#comment-17805046
]
Stamatis Zampetakis commented on HIVE-27985:
--------------------------------------------
Thanks for raising this ticket [~zhengchenyu]. I went over the description and
high-level the proposal seems reasonable. However, I am not that familiar with
these parts of the code so I don't think I am the best person to review this.
Moreover, I am not sure if HIVE-27986 is completely safe to merge. Aren't we
risking breaking backward compatibility by renaming the final files? The PR
under HIVE-27986 has quite a few failures. Have you looked over those? Is there
something worrisome there? Can they be addressed easily?
> Avoid duplicate files.
> ----------------------
>
> Key: HIVE-27985
> URL: https://issues.apache.org/jira/browse/HIVE-27985
> Project: Hive
> Issue Type: Bug
> Components: Tez
> Reporter: Chenyu Zheng
> Assignee: Chenyu Zheng
> Priority: Major
> Attachments: how tez examples commit.png
>
>
> *1 introducation*
> Hive on Tez occasionally produces duplicated files, especially speculative
> execution is enable. Hive identifies and removes duplicate files through
> removeTempOrDuplicateFiles. However, this logic often does not take effect.
> For example, the killed task attempt may commit files during the execution of
> this method. Or the files under HIVE_UNION_SUBDIR_X are not recognized during
> union all. There are many issues to solve these problems, mainly focusing on
> how to identify duplicate files. *This issue mainly solves this problem by
> avoiding the generation of duplicate files.*
> *2 How Tez avoids duplicate files?*
> After testing, I found that Hadoop MapReduce examples and Tez examples do not
> have this problem. Through OutputCommitter, duplicate files can be avoided if
> designed properly. Let's analyze how Tez avoids duplicate files.
> {color:#172b4d} _Note: Compared with Tez, Hadoop MapReduce has one more
> commitPending, which is not critical, so only analyzing Tez._{color}
> !how tez examples commit.png|width=778,height=483!
>
> Let’s analyze this step:
> * (1) {*}process records{*}: Process records.
> * (2) {*}send canCommit request{*}: After all Records are processed, call
> canCommit remotely to AM.
> * (3) {*}update commitAttempt{*}: After AM receives the canCommit request,
> it will check whether there are other tasksattempts in the current task that
> have already executed canCommit. If there is no other taskattempt to execute
> canCommit first, return true. Otherwise return false. This ensures that only
> one taskattempt is committed for each task.
> * (4) {*}return canCommit response{*}: Task receives AM's response. If
> returns true, it means it can be committed. If false is returned, it means
> that another task attempt has already executed the commit first, and you
> cannot commit. The task will jump into (2) loop to execute canCommit until it
> is killed or other tasks fail.
> * (5) {*}output.commit{*}: Execute commit, specifically rename the generated
> temporary file to the final file.
> * (6) {*}notify succeeded{*}: Although the task has completed the final
> file, AM still needs to be notified that its work is completed. Therefore, AM
> needs to be notified through heartbeat that the current task attempt has been
> completed.
> There is a problem in the above steps. That is, if an exception occurs in the
> task after (5) and before (6), AM does not know that the Task attempt has
> been completed, so AM will still start a new task attempt, and the new task
> attempt will generate a new file, so It will cause duplication. I added code
> for randomly throwing exceptions between (5) and (6), and found that in fact,
> Tez example did not produce data duplication. Why? Mainly because the final
> file generated by which task attempt is the same is the same. When a new task
> attempt commits and finds that the final file exists (this file was generated
> by the previous task attempt), it will be deleted firstly, then renamed.
> Regardless of whether the previous task attempt was committed normally, the
> last successful task will clear the previous error results.
> To summarize, tez-examples uses two methods to avoid duplicate files:
> * (1) Avoid repeated commit through canCommit. This is particularly
> effective for tasks with speculative execution turned on.
> * (2) The final file names generated by different task attempts are the
> same. Combined with canCommit, it can be guaranteed that only one file
> generated in the end, and it can only be generated by a successful task
> attempt.
> *3 Why can't Hive on Tez avoid duplicate files?*
> Hive on Tez does not have the two mechanisms mentioned in the Tez example.
> First of all, Hive on Tez does not call canCommit.TezProcessor inherited from
> AbstractLogicalIOProcessor. The logic of canCommit in Tez examples is mainly
> in SimpleMRProcessor.
> Secondly, the file names generated for each file under Hive on Tez are not
> same. The file generated by the first attempt of a task is 000000_0, and the
> file generated by the second attempt is 000000_1.
> *4 How to improve?*
> Use canCommit to ensure that speculative tasks will not be submitted at the
> same time. (HIVE-27899)
> Let different task attempts for each task generate the same final file name.
> (HIVE-27986)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)