Jason Lowe created MAPREDUCE-4852:
-------------------------------------

             Summary: Reducer should not signal fetch failures for disk errors 
on the reducer's side
                 Key: MAPREDUCE-4852
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4852
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: mrv2
            Reporter: Jason Lowe


Ran across a case where a reducer ran on a node where the disks were full, 
leading to an exception like this during the shuffle fetch:

{noformat}
2012-12-05 09:07:28,749 INFO [fetcher#25] 
org.apache.hadoop.mapreduce.task.reduce.MergeManager: 
attempt_1352354913026_138167_m_000654_0: Shuffling to disk since 235056188 is 
greater than maxSingleShuffleLimit (155104064)
2012-12-05 09:07:28,755 INFO [fetcher#25] 
org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#25 failed to read map 
headerattempt_1352354913026_138167_m_000654_0 decomp: 235056188, 101587629
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid 
local directory for output/attempt_1352354913026_138167_r_000189_0/map_654.out
        at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)
        at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
        at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
        at 
org.apache.hadoop.mapred.YarnOutputFiles.getInputFileForWrite(YarnOutputFiles.java:213)
        at 
org.apache.hadoop.mapreduce.task.reduce.MapOutput.<init>(MapOutput.java:81)
        at 
org.apache.hadoop.mapreduce.task.reduce.MergeManager.reserve(MergeManager.java:245)
        at 
org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:348)
        at 
org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:283)
        at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:155)
2012-12-05 09:07:28,755 WARN [fetcher#25] 
org.apache.hadoop.mapreduce.task.reduce.Fetcher: copyMapOutput failed for tasks 
[attempt_1352354913026_138167_m_000654_0]
2012-12-05 09:07:28,756 INFO [fetcher#25] 
org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: Reporting fetch 
failure for attempt_1352354913026_138167_m_000654_0 to jobtracker.
{noformat}

Even though the error was local to the reducer, it reported the error as a 
fetch failure to the AM than failing the reducer itself.  It then proceeded to 
run into the same error for many other maps, causing them to relaunch from 
reported fetch failures.  In this case it would have been better to fail the 
reducer and try another node rather than blame the mapper for what is an error 
on the reducer's side.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to