[ 
https://issues.apache.org/jira/browse/MAPREDUCE-7448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17748946#comment-17748946
 ] 

Steve Loughran commented on MAPREDUCE-7448:
-------------------------------------------

process: can you tag with the version you are reporting on?

After a quick glance at the source, I have to agree. Sort of.

job IDs aren't used in the paths, so the output of attempt 0 goes into 
_temporary/0, attempt 1 into _temporary/1, etc.

If a job with a different job ID is run on the same final destination path *and 
you have configured that job to add files to the existing path* then the next 
job will have the same path. It may pick up the existing files -except if that 
previous job commit succeeded, the files will already have been renamed. Only 
the new files from the new job should be picked up.

Even so, it's not particularly safe. Using job ID under _temporary would appear 
to be a solution, but it requires job IDs to be unique, which 
HADOOP-17318/SPARK-33230 show is not already true.

And we are scared of making changes to that committer because it is such a 
critical piece of code.

The disabling cleanup option is only there to speed up cleanup on GCS storage 
with O(dir) deletion -I'd recommend a documentation patch warning about.

Now
# which version of hadoop are you using?
# what filesystem? HDFS or something else?

If you are on hadoop 3.3.5, can you try the manifest committer instead? it does 
work for HDFS, even if it is optimised for cloud stores where list, rename and 
delete are a lot slower.






> Inconsistent Behavior for FileOutputCommitter V1 to commit successfully many 
> times
> ----------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-7448
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7448
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: ConfX
>            Priority: Critical
>         Attachments: reproduce.sh
>
>
> h2. What happened
> I turned on {{mapreduce.fileoutputcommitter.cleanup.skipped=true}} and then 
> the version 1 of {{FileOutputCommitter}} can commit several times, which is 
> unexpected.
> h2. Where's the problem
> In {{{}FileOutputCommitter.commitJobInternal{}}},
> {noformat}
> if (algorithmVersion == 1) {
>         for (FileStatus stat: getAllCommittedTaskPaths(context)) {
>           mergePaths(fs, stat, finalOutput, context);
>         }
>       }      if (skipCleanup) {
>         LOG.info("Skip cleanup the _temporary folders under job's output " +
>             "directory in commitJob.");
> ...{noformat}
> Here if we skip cleanup, the _temporary folder would not be deleted and the 
> _SUCCESS file would also not be created, which cause the {{mergePaths}} next 
> time to not fail.
> h2. How to reproduce
>  # set {{{}mapreduce.fileoutputcommitter.cleanup.skipped{}}}={{{}true{}}}
>  # run 
> {{org.apache.hadoop.mapred.TestFileOutputCommitter#testCommitterWithDuplicatedCommitV1}}
> you should observe
> {noformat}
> java.lang.AssertionError: Duplicate commit successful: wrong behavior for 
> version 1.
>     at org.junit.Assert.fail(Assert.java:89)
>     at 
> org.apache.hadoop.mapred.TestFileOutputCommitter.testCommitterWithDuplicatedCommitInternal(TestFileOutputCommitter.java:295)
>     at 
> org.apache.hadoop.mapred.TestFileOutputCommitter.testCommitterWithDuplicatedCommitV1(TestFileOutputCommitter.java:269){noformat}
> For an easy reproduction, run the reproduce.sh in the attachment.
> We are happy to provide a patch if this issue is confirmed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to