Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1360#issuecomment-53163562 @markhamstra In our cluster, this usually happens due to one or more executor being in a bad state : either due to insufficient disk for finishing a task or it is in process of cleaning up and exit'ing. When task fails, usually due to locality, the same task gets re-assigned to the executor where it just failed usually due to locality match. And this repeated re-schedule on failing executor, fail loop causes application to fail since it hits the maximum number of failed tasks for application, or maximum number of task failures for a specific task (iirc there are two params). We alleviate this by setting blacklist timeout to a non trivially appropriate value : this prevents the rapid reschedule of a failing task on the executor (and usually some other executor picks up the task - the timeout is chosen so that this is possible). If the executor is healthy but cant execute this specific task, then blacklist works fine. If executor is unhealthy and going to exit, then we will still have rapid task failures until executor notifies master when it exits - but the failure count per task is not hit (iirc the number of failed tasks for app is much higher than number of failed attempts per tasks). Ofcourse, not sure if this is completely applicable in this case.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org