Jeremy Karn created HADOOP-9184:
-----------------------------------

             Summary: Some reducers failing to write final output file to s3.
                 Key: HADOOP-9184
                 URL: https://issues.apache.org/jira/browse/HADOOP-9184
             Project: Hadoop Common
          Issue Type: Bug
    Affects Versions: 0.20.2
            Reporter: Jeremy Karn
         Attachments: task_log.txt

We had a Hadoop job that was running 100 reducers with most of the reducers 
expected to write out an empty file. When the final output was to an S3 bucket 
we were finding that sometimes we were missing a final part file.  This was 
happening approximately 1 job in 3 (so approximately 1 reducer out of 300 was 
failing to output the data properly). I've attached the pig script we were 
using to reproduce the bug.

After an in depth look and instrumenting the code we traced the problem to 
moveTaskOutputs in FileOutputCommitter.  

The code there looked like:

{code}
    if (fs.isFile(taskOutput)) {
        … do stuff …       
    } else if(fs.getFileStatus(taskOutput).isDir()) {
        … do stuff … 
    }
{code}

And what we saw happening is that for the problem jobs neither path was being 
exercised.  I've attached the task log of our instrumented code.  In this 
version we added an else statement and printed out the line "THIS SEEMS LIKE WE 
SHOULD NEVER GET HERE …".

The root cause of this seems to be an eventual consistency issue with S3.  You 
can see in the log that the first time moveTaskOutputs is called it finds that 
the taskOutput is a directory.  It goes into the isDir() branch and 
successfully retrieves the list of files in that directory from S3 (in this 
case just one file).  This triggers a recursive call to moveTaskOutputs for the 
file found in the directory.  But in this pass through moveTaskOutput the 
temporary output file can't be found resulting in both branches of the above if 
statement being skipped and the temporary file never being moved to the final 
output location.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to