[
https://issues.apache.org/jira/browse/MAPREDUCE-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe resolved MAPREDUCE-4852.
-----------------------------------
Resolution: Duplicate
This was fixed by MAPREDUCE-5251.
> Reducer should not signal fetch failures for disk errors on the reducer's side
> ------------------------------------------------------------------------------
>
> Key: MAPREDUCE-4852
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4852
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Reporter: Jason Lowe
>
> Ran across a case where a reducer ran on a node where the disks were full,
> leading to an exception like this during the shuffle fetch:
> {noformat}
> 2012-12-05 09:07:28,749 INFO [fetcher#25]
> org.apache.hadoop.mapreduce.task.reduce.MergeManager:
> attempt_1352354913026_138167_m_000654_0: Shuffling to disk since 235056188 is
> greater than maxSingleShuffleLimit (155104064)
> 2012-12-05 09:07:28,755 INFO [fetcher#25]
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: fetcher#25 failed to read
> map headerattempt_1352354913026_138167_m_000654_0 decomp: 235056188, 101587629
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any
> valid local directory for
> output/attempt_1352354913026_138167_r_000189_0/map_654.out
> at
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)
> at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
> at
> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
> at
> org.apache.hadoop.mapred.YarnOutputFiles.getInputFileForWrite(YarnOutputFiles.java:213)
> at
> org.apache.hadoop.mapreduce.task.reduce.MapOutput.<init>(MapOutput.java:81)
> at
> org.apache.hadoop.mapreduce.task.reduce.MergeManager.reserve(MergeManager.java:245)
> at
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:348)
> at
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:283)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:155)
> 2012-12-05 09:07:28,755 WARN [fetcher#25]
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: copyMapOutput failed for
> tasks [attempt_1352354913026_138167_m_000654_0]
> 2012-12-05 09:07:28,756 INFO [fetcher#25]
> org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler: Reporting fetch
> failure for attempt_1352354913026_138167_m_000654_0 to jobtracker.
> {noformat}
> Even though the error was local to the reducer, it reported the error as a
> fetch failure to the AM than failing the reducer itself. It then proceeded
> to run into the same error for many other maps, causing them to relaunch from
> reported fetch failures. In this case it would have been better to fail the
> reducer and try another node rather than blame the mapper for what is an
> error on the reducer's side.
--
This message was sent by Atlassian JIRA
(v6.2#6252)