Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/10045#issuecomment-161061306
  
    After thinking about this (and some of the corner cases) more I would 
propose two steps here:
    
    (1) For blacklisting executors per-task, I think @davies proposal makes 
most sense, which was to allow a task to run twice on a particular executor, 
and if it fails both times on that executor, blacklist the executor for the 
task permanently.  The benefit of that proposal is that it works without 
requiring the user to configure a blacklist timeout; determining a timeout is 
very workload dependent, and I think puts an unnecessary burden on the user.  I 
think this handles the transient failure issue relatively well, since the 
executor is only blacklisted for the particular task that failed twice, so 
there are plenty of future opportunities (with other tasks) for the executor to 
keep running things.  I do think it makes sense to have a per-task blacklist 
that's more aggressive than the more general executor/host blacklist mechanism. 
 @mridulm does this seem reasonable?
    
    (2) We still need a solution for handling executors / hosts that are more 
permanently problematic, in which case the above approach of blacklisting only 
for a particular task is too inefficient.  Let's have that discussion on #8760, 
to separate the two issues.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to