Hi, If this is the problem you are getting i think the below solution will help you.I have shared my experiences which i faced earlier.
Issue: while running a job The maps complete properly. However when the reduce phase begins, it works for sometime upto some % but then while copying the map outputs from other machine in the 'shuffle phase' it throws an Exception and says that there was a shuffle Error as the connection was refused. In Brief The reduce tasks have to collect the output of the map tasks before they can run the sort the output and start your reduce class. This is called the FETCH. The job tracker passes the hostnames of the machines that ran the map tasks to the reducer task. These hostnames must resolve to the correct ip address of the machine that ran the map task, and the reduce task must be able to connect to that machine on usually a port to request the data stream. >From Log File: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#2 at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:124) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:362) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:742) at org.apache.hadoop.mapred.Child.main(Child.java:211) Caused by: java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. at org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler.checkReducerHealth(ShuffleScheduler.java:253) at org.apache.hadoop.mapreduce.task.reduce.ShuffleScheduler.copyFailed(ShuffleScheduler.java:187) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:227) at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:149) Fix Or Solution: Make an entry of the hostnames whoever are in the cluster to /etc/hosts file. Thanks & Regards Rajesh Putta M Tech CSE IIIT-H On Tue, Jul 19, 2011 at 4:30 AM, Arun C Murthy <a...@hortonworks.com> wrote: > > On Jul 18, 2011, at 3:02 PM, Geoffry Roberts wrote: > > > All, > > > > I am getting the following errors during my MR jobs (see below). > Ultimately the jobs finish well enough, but these errors do slow things > down. I've done some reading and I understand that this is all caused by > failures in my network. Is there a way of determining which node(s) in my > cluster are causing the problem? > > > > The TT running on 'localhost' ran attempt_201107180916_0030_m_000003_0 > whose output couldn't be fetched. Take a look at the TT logs and see what > you find. > > Arun > > > > > Thanks > > > > 11/07/18 14:53:06 INFO mapreduce.Job: map 99% reduce 28% > > 11/07/18 14:53:10 INFO mapreduce.Job: map 100% reduce 28% > > 11/07/18 14:53:15 INFO mapreduce.Job: Task Id : > attempt_201107180916_0030_m_000003_0, Status : FAILED > > Too many fetch-failures > > 11/07/18 14:53:15 WARN mapreduce.Job: Error reading task > outputhttp://localhost:50060/tasklog?plaintext=true&attemptid=attempt_201107180916_0030_m_000003_0&filter=stdout > > 11/07/18 14:53:15 WARN mapreduce.Job: Error reading task > outputhttp://localhost:50060/tasklog?plaintext=true&attemptid=attempt_201107180916_0030_m_000003_0&filter=stderr > > 11/07/18 14:53:17 INFO mapreduce.Job: map 100% reduce 29% > > 11/07/18 14:53:19 INFO mapreduce.Job: map 96% reduce 29% > > 11/07/18 14:53:25 INFO mapreduce.Job: map 98% reduce 29% > > > > > > -- > > Geoffry Roberts > > > >