The patch to hadoop-400, which is committed and will make it into hadoop-0.6.0, addresses this issue by scheduling a task on the same node it failed on, only if it has already run on every other node (i.e. only on a very small cluster). Also, hadoop-442 requests the option to exclude 'bad' nodes in the slaves file, to prevent them from interfering. It hasn't been addressed yet.
Yoram -----Original Message----- From: Gian Lorenzo Thione [mailto:[EMAIL PROTECTED] Sent: Friday, August 18, 2006 6:07 AM To: [email protected] Subject: Bad tracker... If a task tracker is alive and continues sending heartbeat but the network falls in a state in which the job tracker is unable to contact the task tracker, the node remains on the list of clients but every attempt to assign a task to that tracker will fail. Unfortunately, it seems that hadoop doesn't really avoid scheduling the same task over and over to that same client, even if the vast majority of nodes in the cluster are alive and kicking and after a task fails 5 times, the entire job fails. Is there anyway that a bad tracker can be removed from the list of clients if the rate of failure is above a certain threshold (maybe consectuive errors even) even if it is sending heartbeats to the job tracker? I noticed that the total number of errors is tracked and the machine is even highlighted as having a high number of errors in the machine list page of the webserver.... Thanks, Lorenzo Thione Powerset, Inc.
