Take at look at slide 25: http://www.slideshare.net/cloudera/hadoop-troubleshooting-101-kate-ting-cloudera
It describes a similar error so hopefully this will help you. ~ Minh On Tue, Jun 19, 2012 at 10:27 AM, Ellis H. Wilson III <el...@cse.psu.edu> wrote: > Hi all, > > This is my first email to the list, so feel free to be candid in your > complaints if I'm doing something canonically uncouth in my requests for > assistance. > > I'm using Hadoop 0.23 on 50 machines, each connected with gigabit ethernet > and each having solely a single hard disk. I am getting the following error > repeatably for the TeraSort benchmark. TeraGen runs without error, but > TeraSort runs predictably until this error pops up between 64% and 70% > completion. This doesn't occur for every execution of the benchmark, as > about one out of four times that I run the benchmark it does run to > completion (TeraValidate included). > > Error at the CLI: > "12/06/10 11:17:50 INFO mapreduce.Job: map 100% reduce 64% > 12/06/10 11:20:45 INFO mapreduce.Job: Task Id : > attempt_1339331790635_0002_m_004337_0, Status : FAILED > Container killed by the ApplicationMaster. > > Too Many fetch failures.Failing the attempt > 12/06/10 11:21:45 WARN mapreduce.Job: Error reading task output Read timed > out > 12/06/10 11:23:06 WARN mapreduce.Job: Error reading task output Read timed > out > 12/06/10 11:23:07 INFO mapreduce.Job: Task Id : > attempt_1339331790635_0002_m_004613_0, Status : FAILED" > > I am still warming up to Yarn, so am not deft yet at getting all the > logfiles I need, but under more careful inspection of the logs I could find > and the machines themselves it seems like this is related to many numbers of > sockets being up concurrently, which at some point prevents further > connections being made from the requesting Reduce to the Map which has the > data desired, leading the Reducer to believe there is some error in getting > that data. These errors continue to be spewed once about every 3 minutes > for about 45 minutes until at last the job dies completely. > > I have attached my -site.xml files so that a better idea of my configuration > is evident, and any and all suggestions or queries for more info are > welcome. Things I have tried already, per the document I found at > http://www.slideshare.net/cloudera/hadoop-troubleshooting-101-kate-ting-cloudera: > > mapred.reduce.slowstart.completed.maps = 0.80 (seems to help, but it hurts > performance as I'm the only person running on the cluster, and it doesn't > cure the problem -- just increases chance of completion from 1/4 to 1/3 at > best) > > tasktracker.http.threads = 80 (default is 40 I think, and I've tried this > and even much higher values to no avail) > > Best, and Thanks in Advance, > > ellis