On 06/19/12 14:11, Minh Duc Nguyen wrote:
Take at look at slide 25:
http://www.slideshare.net/cloudera/hadoop-troubleshooting-101-kate-ting-cloudera

It describes a similar error so hopefully this will help you.

I appreciate your prompt response Minh, but as you will notice in the end of my very email, I mentioned that I had previously seen this slide and tried two of those solutions, to no avail. I should also note that I added /etc/hosts to each of my nodes such that, if it was a DNS issue, that would handle it. The only other proposed solution suggested upgrading Jetty, but I wasn't sure about (sorry for the naiveté) how one could tell the version of Jetty in use. Any ideas? Or is this no longer an issue with Hadoop 2.0?

Best,

ellis


On Tue, Jun 19, 2012 at 10:27 AM, Ellis H. Wilson III<el...@cse.psu.edu>  wrote:
Hi all,

This is my first email to the list, so feel free to be candid in your
complaints if I'm doing something canonically uncouth in my requests for
assistance.

I'm using Hadoop 0.23 on 50 machines, each connected with gigabit ethernet
and each having solely a single hard disk.  I am getting the following error
repeatably for the TeraSort benchmark.  TeraGen runs without error, but
TeraSort runs predictably until this error pops up between 64% and 70%
completion.  This doesn't occur for every execution of the benchmark, as
about one out of four times that I run the benchmark it does run to
completion (TeraValidate included).

Error at the CLI:
"12/06/10 11:17:50 INFO mapreduce.Job:  map 100% reduce 64%
12/06/10 11:20:45 INFO mapreduce.Job: Task Id :
attempt_1339331790635_0002_m_004337_0, Status : FAILED
Container killed by the ApplicationMaster.

Too Many fetch failures.Failing the attempt
12/06/10 11:21:45 WARN mapreduce.Job: Error reading task output Read timed
out
12/06/10 11:23:06 WARN mapreduce.Job: Error reading task output Read timed
out
12/06/10 11:23:07 INFO mapreduce.Job: Task Id :
attempt_1339331790635_0002_m_004613_0, Status : FAILED"

I am still warming up to Yarn, so am not deft yet at getting all the
logfiles I need, but under more careful inspection of the logs I could find
and the machines themselves it seems like this is related to many numbers of
sockets being up concurrently, which at some point prevents further
connections being made from the requesting Reduce to the Map which has the
data desired, leading the Reducer to believe there is some error in getting
that data.  These errors continue to be spewed once about every 3 minutes
for about 45 minutes until at last the job dies completely.

I have attached my -site.xml files so that a better idea of my configuration
is evident, and any and all suggestions or queries for more info are
welcome.  Things I have tried already, per the document I found at
http://www.slideshare.net/cloudera/hadoop-troubleshooting-101-kate-ting-cloudera:

mapred.reduce.slowstart.completed.maps = 0.80 (seems to help, but it hurts
performance as I'm the only person running on the cluster, and it doesn't
cure the problem -- just increases chance of completion from 1/4 to 1/3 at
best)

tasktracker.http.threads = 80 (default is 40 I think, and I've tried this
and even much higher values to no avail)

Best, and Thanks in Advance,

ellis


Reply via email to