You are probably having a very low somaxconn parameter ( default centos has it 
at 128 , if I remember correctly). You can check the value under 
/proc/sys/net/core/somaxconn

Can you also check the value of ulimit -n? It could be  low.

Raj



>________________________________
> From: Ellis H. Wilson III <el...@cse.psu.edu>
>To: common-user@hadoop.apache.org 
>Sent: Tuesday, June 19, 2012 12:32 PM
>Subject: Re: Error: Too Many Fetch Failures
> 
>On 06/19/12 13:38, Vinod Kumar Vavilapalli wrote:
>> 
>> Replies/more questions inline.
>> 
>> 
>>> I'm using Hadoop 0.23 on 50 machines, each connected with gigabit ethernet 
>>> and each having solely a single hard disk.  I am getting the following 
>>> error repeatably for the TeraSort benchmark.  TeraGen runs without error, 
>>> but TeraSort runs predictably until this error pops up between 64% and 70% 
>>> completion.  This doesn't occur for every execution of the benchmark, as 
>>> about one out of four times that I run the benchmark it does run to 
>>> completion (TeraValidate included).
>> 
>> 
>> How many containers are you running per node?
>
>Per my attached config files, I specify that 
>yarn.nodemanager.resource.memory-mb = 3072, and the default /seems/ to be set 
>at 1024MB for maps and reducers, so I have 3 containers running per node.  I 
>have verified that this indeed is the case in the web client.  Three of these 
>1GB "slots" in the cluster appear to be occupied by something else during the 
>execution of TeraSort, so I specify that TeraGen create .5TB using 441 maps 
>(3waves * (50nodes * 3containerslots - 3occupiedslots)), and TeraSort to use 
>147 reducers.  This seems to give me the guarantees I had with Hadoop 1.0 that 
>each node gets an equal number of reducers, and my job doesn't drag on due to 
>straggler reducers.
>
>> Clearly maps are getting killed because of fetch failures. Can you look at 
>> the logs of the NodeManager where this particular map task ran. That may 
>> have logs related to why reducers are not able to fetch map-outputs. It is 
>> possible that because you have only one disk per node, some of these nodes 
>> have bad or unfunctional disks and thereby causing fetch failures.
>
>I will rerun and report the exact error messages from the NodeManagers.  Can 
>you give me more exacting advice on collecting logs of this sort, for as I 
>mentioned I'm new to doing so with the new version of Hadoop? I have been 
>looking in /tmp/logs and hadoop/logs, but perhaps there is somewhere else to 
>look as well?
>
>Last, I am certain this is not related to failing disks, as this exact error 
>occurs at much higher frequencies when I run Hadoop on a NAS box, which is the 
>core of my research at the moment.  Nevertheless, I posted to this list 
>instead of Dev as this was on vanilla CentOS-5.5 machines using just the HDDs 
>within each, and therefore should be a highly typical setup.  In particular, I 
>see these errors coming from numerous nodes all at once, and the subset of 
>nodes giving the problems are not repeatable from one run to the next, though 
>the resulting error is.
>
>> If that is the case, either you can offline these nodes or bump up 
>> mapreduce.reduce.shuffle.maxfetchfailures to tolerate these failures, the 
>> default is 10. There are other some tweaks which I can tell if you can find 
>> more details from your logs.
>
>I'd prefer to not bump up maxfetchfailures, and would rather simply fix the 
>issue that is causing the fetch to fail in the beginning.  This isn't a large 
>cluster, having only 50 nodes, nor are the links (1gig) or storage 
>capabilities (1 sata drive) great or strange relative to any normal 
>installation.  I have to assume here that I've mis-configured something :(.
>
>Best,
>
>ellis
>
>
>

Reply via email to