Re: Error: Too Many Fetch Failures

Minh Duc Nguyen Tue, 19 Jun 2012 12:12:32 -0700

Take at look at slide 25:
http://www.slideshare.net/cloudera/hadoop-troubleshooting-101-kate-ting-cloudera


It describes a similar error so hopefully this will help you.

   ~ Minh

On Tue, Jun 19, 2012 at 10:27 AM, Ellis H. Wilson III <el...@cse.psu.edu> wrote:
> Hi all,
>
> This is my first email to the list, so feel free to be candid in your
> complaints if I'm doing something canonically uncouth in my requests for
> assistance.
>
> I'm using Hadoop 0.23 on 50 machines, each connected with gigabit ethernet
> and each having solely a single hard disk.  I am getting the following error
> repeatably for the TeraSort benchmark.  TeraGen runs without error, but
> TeraSort runs predictably until this error pops up between 64% and 70%
> completion.  This doesn't occur for every execution of the benchmark, as
> about one out of four times that I run the benchmark it does run to
> completion (TeraValidate included).
>
> Error at the CLI:
> "12/06/10 11:17:50 INFO mapreduce.Job:  map 100% reduce 64%
> 12/06/10 11:20:45 INFO mapreduce.Job: Task Id :
> attempt_1339331790635_0002_m_004337_0, Status : FAILED
> Container killed by the ApplicationMaster.
>
> Too Many fetch failures.Failing the attempt
> 12/06/10 11:21:45 WARN mapreduce.Job: Error reading task output Read timed
> out
> 12/06/10 11:23:06 WARN mapreduce.Job: Error reading task output Read timed
> out
> 12/06/10 11:23:07 INFO mapreduce.Job: Task Id :
> attempt_1339331790635_0002_m_004613_0, Status : FAILED"
>
> I am still warming up to Yarn, so am not deft yet at getting all the
> logfiles I need, but under more careful inspection of the logs I could find
> and the machines themselves it seems like this is related to many numbers of
> sockets being up concurrently, which at some point prevents further
> connections being made from the requesting Reduce to the Map which has the
> data desired, leading the Reducer to believe there is some error in getting
> that data.  These errors continue to be spewed once about every 3 minutes
> for about 45 minutes until at last the job dies completely.
>
> I have attached my -site.xml files so that a better idea of my configuration
> is evident, and any and all suggestions or queries for more info are
> welcome.  Things I have tried already, per the document I found at
> http://www.slideshare.net/cloudera/hadoop-troubleshooting-101-kate-ting-cloudera:
>
> mapred.reduce.slowstart.completed.maps = 0.80 (seems to help, but it hurts
> performance as I'm the only person running on the cluster, and it doesn't
> cure the problem -- just increases chance of completion from 1/4 to 1/3 at
> best)
>
> tasktracker.http.threads = 80 (default is 40 I think, and I've tried this
> and even much higher values to no avail)
>
> Best, and Thanks in Advance,
>
> ellis

Re: Error: Too Many Fetch Failures

Reply via email to