Probably unrelated to your problem, but one extreme case I've seen, a user's job with large gzip inputs (non-splittable), 20 mappers 800 reducers. Each map outputted like 20G. Too many reducers were hitting a single node as soon as a mapper finished.
I think we tried something like mapred.reduce.parallel.copies=1 (to reduce number of reducer copier threads) mapred.reduce.slowstart.completed.maps=1.0 (so that reducers would have 20 mappers to pull from, instead of 800 reducers hitting 1 mapper node as soon as it finishes.) Koji On 8/19/09 11:59 PM, "Jason Venner" <jason.had...@gmail.com> wrote: > The number 1 cause of this is something that causes a connection to get a > map output to fail. I have seen: > 1) firewall > 2) misconfigured ip addresses (ie: the task tracker attempting the fetch > received an incorrect ip address when it looked up the name of the > tasktracker with the map segment) > 3) rare, the http server on the serving tasktracker is overloaded due to > insufficient threads or listen backlog, this can happen if the number of > fetches per reduce is large and the number of reduces or the number of maps > is very large > > There are probably other cases, this recently happened to me when I had 6000 > maps and 20 reducers on a 10 node cluster, which I believe was case 3 above. > Since I didn't actually need to reduce ( I got my summary data via counters > in the map phase) I never re-tuned the cluster. > > On Wed, Aug 19, 2009 at 11:25 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > >> I think that the problem that I am remembering was due to poor recovery >> from >> this problem. The underlying fault is likely due to poor connectivity >> between your machines. Test that all members of your cluster can access >> all >> others on all ports used by hadoop. >> >> See here for hints: http://markmail.org/message/lgafou6d434n2dvx >> >> On Wed, Aug 19, 2009 at 10:39 PM, yang song <hadoop.ini...@gmail.com> >> wrote: >> >>> Thank you Ted. Update current cluster is a huge work, we don't want to >>> do so. Could you tell me how 0.19.1 causes certain failures in detail? >>> Thanks again. >>> >>> 2009/8/20 Ted Dunning <ted.dunn...@gmail.com> >>> >>>> I think I remember something about 19.1 in which certain failures would >>>> cause this. Consider using an updated 19 or moving to 20 as well. >>>> >>>> On Wed, Aug 19, 2009 at 5:19 AM, yang song <hadoop.ini...@gmail.com> >>>> wrote: >>>> >>>>> I'm sorry, the version is 0.19.1 >>>>> >>>>> >>>> >>> >> >> >> >> -- >> Ted Dunning, CTO >> DeepDyve >> > >