You are probably having a very low somaxconn parameter ( default centos has it at 128 , if I remember correctly). You can check the value under /proc/sys/net/core/somaxconn
Can you also check the value of ulimit -n? It could be low. Raj >________________________________ > From: Ellis H. Wilson III <el...@cse.psu.edu> >To: common-user@hadoop.apache.org >Sent: Tuesday, June 19, 2012 12:32 PM >Subject: Re: Error: Too Many Fetch Failures > >On 06/19/12 13:38, Vinod Kumar Vavilapalli wrote: >> >> Replies/more questions inline. >> >> >>> I'm using Hadoop 0.23 on 50 machines, each connected with gigabit ethernet >>> and each having solely a single hard disk. I am getting the following >>> error repeatably for the TeraSort benchmark. TeraGen runs without error, >>> but TeraSort runs predictably until this error pops up between 64% and 70% >>> completion. This doesn't occur for every execution of the benchmark, as >>> about one out of four times that I run the benchmark it does run to >>> completion (TeraValidate included). >> >> >> How many containers are you running per node? > >Per my attached config files, I specify that >yarn.nodemanager.resource.memory-mb = 3072, and the default /seems/ to be set >at 1024MB for maps and reducers, so I have 3 containers running per node. I >have verified that this indeed is the case in the web client. Three of these >1GB "slots" in the cluster appear to be occupied by something else during the >execution of TeraSort, so I specify that TeraGen create .5TB using 441 maps >(3waves * (50nodes * 3containerslots - 3occupiedslots)), and TeraSort to use >147 reducers. This seems to give me the guarantees I had with Hadoop 1.0 that >each node gets an equal number of reducers, and my job doesn't drag on due to >straggler reducers. > >> Clearly maps are getting killed because of fetch failures. Can you look at >> the logs of the NodeManager where this particular map task ran. That may >> have logs related to why reducers are not able to fetch map-outputs. It is >> possible that because you have only one disk per node, some of these nodes >> have bad or unfunctional disks and thereby causing fetch failures. > >I will rerun and report the exact error messages from the NodeManagers. Can >you give me more exacting advice on collecting logs of this sort, for as I >mentioned I'm new to doing so with the new version of Hadoop? I have been >looking in /tmp/logs and hadoop/logs, but perhaps there is somewhere else to >look as well? > >Last, I am certain this is not related to failing disks, as this exact error >occurs at much higher frequencies when I run Hadoop on a NAS box, which is the >core of my research at the moment. Nevertheless, I posted to this list >instead of Dev as this was on vanilla CentOS-5.5 machines using just the HDDs >within each, and therefore should be a highly typical setup. In particular, I >see these errors coming from numerous nodes all at once, and the subset of >nodes giving the problems are not repeatable from one run to the next, though >the resulting error is. > >> If that is the case, either you can offline these nodes or bump up >> mapreduce.reduce.shuffle.maxfetchfailures to tolerate these failures, the >> default is 10. There are other some tweaks which I can tell if you can find >> more details from your logs. > >I'd prefer to not bump up maxfetchfailures, and would rather simply fix the >issue that is causing the fetch to fail in the beginning. This isn't a large >cluster, having only 50 nodes, nor are the links (1gig) or storage >capabilities (1 sata drive) great or strange relative to any normal >installation. I have to assume here that I've mis-configured something :(. > >Best, > >ellis > > >