Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/10045#issuecomment-160879820
  
    @kayousterhout I would have preferred if this "feature" had been actually 
fixed properly by now - it was added as a temporary workaround (and so 
intentionally not documented so that we can remove it later) to an immediate 
problem for which we did not have a principled solution for at that time.
    
    @davies The proposed change (to infinitely blacklist an executor) is not 
unsound IMO. Task failures are usually due to transient reasons - temporary 
disk or memory limitation, etc - infinitely blacklisting makes assumption about 
locality level timeouts, cost-benefit of retrying vs moving RDD partition to 
other node, etc.
    
    On other hand, I do see user's complaining about needing to blacklist nodes 
as opposed to executors fairly commonly - which is something we have seen is an 
issue too (bad memory or disk for example causing most tasks on a node to fail).
    
    
    It would be great if the current hack is removed entirely and we had a 
better more principled solution to this problem which handles executor, node 
and rack (required ?) blacklisting - rather than duct tape it further : but 
since I am not an active contributor off late, I might be oversimplifying the 
effort required :-) Something similar to what I did for task scheduling perhaps 
?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to