Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/1360#issuecomment-53163562
  
    @markhamstra In our cluster, this usually happens due to one or more 
executor being in a bad state : either due to insufficient disk for finishing a 
task or it is in process of cleaning up and exit'ing.
    When task fails, usually due to locality, the same task gets re-assigned to 
the executor where it just failed usually due to locality match. And this 
repeated re-schedule on failing executor, fail loop causes application to fail 
since it hits the maximum number of failed tasks for application, or maximum 
number of task failures for a specific task (iirc there are two params).
    
    We alleviate this by setting blacklist timeout to a non trivially 
appropriate value : this prevents the rapid reschedule of a failing task on the 
executor (and usually some other executor picks up the task - the timeout is 
chosen so that this is possible).
    If the executor is healthy but cant execute this specific task, then 
blacklist works fine.
    If executor is unhealthy and going to exit, then we will still have rapid 
task failures until executor notifies master when it exits - but the failure 
count per task is not hit (iirc the number of failed tasks for app is much 
higher than number of failed attempts per tasks).
    
    
    Ofcourse, not sure if this is completely applicable in this case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to