RE: Bad tracker...

Yoram Arnon Fri, 18 Aug 2006 18:05:35 -0700

The patch to hadoop-400, which is committed and will make it into
hadoop-0.6.0, addresses this issue by scheduling a task on the same node it
failed on, only if it has already run on every other node (i.e. only on a
very small cluster). 
Also, hadoop-442 requests the option to exclude 'bad' nodes in the slaves
file, to prevent them from interfering. It hasn't been addressed yet.

Yoram

-----Original Message-----
From: Gian Lorenzo Thione [mailto:[EMAIL PROTECTED] 
Sent: Friday, August 18, 2006 6:07 AM
To: [email protected]
Subject: Bad tracker...

If a task tracker is alive and continues sending heartbeat but the network
falls in a state in which the job tracker is unable to contact the task
tracker, the node remains on the list of clients but every attempt to assign
a task to that tracker will fail.

Unfortunately, it seems that hadoop doesn't really avoid scheduling the same
task over and over to that same client, even if the vast majority of nodes
in the cluster are alive and kicking and after a task fails 5 times, the
entire job fails.

Is there anyway that a bad tracker can be removed from the list of clients
if the rate of failure is above a certain threshold (maybe consectuive
errors even) even if it is sending heartbeats to the job tracker?

I noticed that the total number of errors is tracked and the machine is even
highlighted as having a high number of errors in the machine list page of
the webserver....

Thanks,

Lorenzo Thione
Powerset, Inc.

RE: Bad tracker...

Reply via email to