If a task tracker is alive and continues sending heartbeat but the network falls in a state in which the job tracker is unable to contact the task tracker, the node remains on the list of clients but every attempt to assign a task to that tracker will fail.
Unfortunately, it seems that hadoop doesn't really avoid scheduling the same task over and over to that same client, even if the vast majority of nodes in the cluster are alive and kicking and after a task fails 5 times, the entire job fails. Is there anyway that a bad tracker can be removed from the list of clients if the rate of failure is above a certain threshold (maybe consectuive errors even) even if it is sending heartbeats to the job tracker? I noticed that the total number of errors is tracked and the machine is even highlighted as having a high number of errors in the machine list page of the webserver.... Thanks, Lorenzo Thione Powerset, Inc.
