Github user squito commented on the pull request:

    https://github.com/apache/spark/pull/8760#issuecomment-142683197
  
    Hi @mwws, I've been through in more detail, sorry if my comments have been 
a bit scattered as I worked through understanding this.  I have one more 
high-level thought about the approach:
    
    Its great that we can update yarn on the blacklisted nodes.  However, do 
the strategies also need someway of *immediately* killing the blacklisted 
executors?  Eg., say you request 5 nodes on a 100 node cluster, and in the 
early phase of your app you discover that one of the executors is failing 
repeatedly.  The change proposed lets us go to scheduling tasks directly on the 
4 good executors, but wouldn't you rather tell yarn you've got a bad executor, 
kill it, and request another one?  This could be left to the strategy to 
choose, but seems like it should be possible and in one of the provided 
implementations.  In that case it wouldn't really make sense to have an expiry 
time for the executors, since you just completely kill them, but I suppose you 
would still have an expiry time for the node blacklist.  Then you also 
naturally escalate from assuming you've got one bad executor, and then if you 
get another bad executor on the same node, you'd blacklist the entire node.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to