Re: How to deal with "too many fetch failures"?

Koji Noguchi Thu, 20 Aug 2009 10:15:54 -0700

Probably unrelated to your problem, but one extreme case I've seen,
a user's job with large gzip inputs (non-splittable),
20 mappers 800 reducers. Each map outputted like 20G.
Too many reducers were hitting a single node as soon as a mapper finished.


I think we tried something like

mapred.reduce.parallel.copies=1
(to reduce number of reducer copier threads)
mapred.reduce.slowstart.completed.maps=1.0
(so that reducers would have 20 mappers to pull from, instead of 800
reducers hitting 1 mapper node as soon as it finishes.)


Koji

On 8/19/09 11:59 PM, "Jason Venner" <jason.had...@gmail.com> wrote:

> The number 1 cause of this is something that causes a connection to get a
> map output to fail. I have seen:
> 1) firewall
> 2) misconfigured ip addresses (ie: the task tracker attempting the fetch
> received an incorrect ip address when it looked up the name of the
> tasktracker with the map segment)
> 3) rare, the http server on the serving tasktracker is overloaded due to
> insufficient threads or listen backlog, this can happen if the number of
> fetches per reduce is large and the number of reduces or the number of maps
> is very large
> 
> There are probably other cases, this recently happened to me when I had 6000
> maps and 20 reducers on a 10 node cluster, which I believe was case 3 above.
> Since I didn't actually need to reduce ( I got my summary data via counters
> in the map phase) I never re-tuned the cluster.
> 
> On Wed, Aug 19, 2009 at 11:25 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> 
>> I think that the problem that I am remembering was due to poor recovery
>> from
>> this problem.  The underlying fault is likely due to poor connectivity
>> between your machines.  Test that all members of your cluster can access
>> all
>> others on all ports used by hadoop.
>> 
>> See here for hints: http://markmail.org/message/lgafou6d434n2dvx
>> 
>> On Wed, Aug 19, 2009 at 10:39 PM, yang song <hadoop.ini...@gmail.com>
>> wrote:
>> 
>>>    Thank you Ted. Update current cluster is a huge work, we don't want to
>>> do so. Could you tell me how 0.19.1 causes certain failures in detail?
>>>    Thanks again.
>>> 
>>> 2009/8/20 Ted Dunning <ted.dunn...@gmail.com>
>>> 
>>>> I think I remember something about 19.1 in which certain failures would
>>>> cause this.  Consider using an updated 19 or moving to 20 as well.
>>>> 
>>>> On Wed, Aug 19, 2009 at 5:19 AM, yang song <hadoop.ini...@gmail.com>
>>>> wrote:
>>>> 
>>>>> I'm sorry, the version is 0.19.1
>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Ted Dunning, CTO
>> DeepDyve
>> 
> 
>

Re: How to deal with "too many fetch failures"?

Reply via email to