[
https://issues.apache.org/jira/browse/HADOOP-9184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13721655#comment-13721655
]
Joydeep Sen Sarma commented on HADOOP-9184:
-------------------------------------------
One question - what the patch seems to do is assert that if a path was listed
initially (in the parent moveTaskOutputs() call) - then it must be listable in
the child call (if it's neither a Dir nor a File - then basically it's not
listable). One thing I don't understand is what is the guarantee that the
FileSystem will accurately list the child paths in the parent function call?
(assuming this happens because of eventual consistency issues with S3)
On a different note - changing the moveTaskOutputs signature to carry a
FileStatus (of the output path) instead of the Path argument removes the need
to throw an exception (since the parent call just passes the status down to the
child - there can never be any disagreement between the two)
> Some reducers failing to write final output file to s3.
> -------------------------------------------------------
>
> Key: HADOOP-9184
> URL: https://issues.apache.org/jira/browse/HADOOP-9184
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 0.20.2
> Reporter: Jeremy Karn
> Attachments: example.pig, HADOOP-9184-branch-0.20.patch,
> hadoop-9184.patch, task_log.txt
>
>
> We had a Hadoop job that was running 100 reducers with most of the reducers
> expected to write out an empty file. When the final output was to an S3
> bucket we were finding that sometimes we were missing a final part file.
> This was happening approximately 1 job in 3 (so approximately 1 reducer out
> of 300 was failing to output the data properly). I've attached the pig script
> we were using to reproduce the bug.
> After an in depth look and instrumenting the code we traced the problem to
> moveTaskOutputs in FileOutputCommitter.
> The code there looked like:
> {code}
> if (fs.isFile(taskOutput)) {
> … do stuff …
> } else if(fs.getFileStatus(taskOutput).isDir()) {
> … do stuff …
> }
> {code}
> And what we saw happening is that for the problem jobs neither path was being
> exercised. I've attached the task log of our instrumented code. In this
> version we added an else statement and printed out the line "THIS SEEMS LIKE
> WE SHOULD NEVER GET HERE …".
> The root cause of this seems to be an eventual consistency issue with S3.
> You can see in the log that the first time moveTaskOutputs is called it finds
> that the taskOutput is a directory. It goes into the isDir() branch and
> successfully retrieves the list of files in that directory from S3 (in this
> case just one file). This triggers a recursive call to moveTaskOutputs for
> the file found in the directory. But in this pass through moveTaskOutput the
> temporary output file can't be found resulting in both branches of the above
> if statement being skipped and the temporary file never being moved to the
> final output location.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira