Github user kayousterhout commented on the issue:

    https://github.com/apache/spark/pull/15249
  
    @tgravescs @mridulm To avoid being stuck in analysis paralysis for this 
feature, I'd propose the following:
    
    (1) We merge this PR.  I think we're mostly in agreement that the behavior 
here is, for the most part, a big step in the right direction.  There are some 
cases Mridul brought up where there is a concern about being too eager in 
permanently blacklisting hosts / executors, but it seems like we don't know if 
(/ don't think!) the scenarios where this is problematic are very common.
    
    (2) Someone (Imran?) email the dev list describing the new functionality 
(and change from the old behavior) and asking for feedback from folks who are 
running large clusters where failures are more common.  I think having folks 
try this out will be more helpful than us guessing when it will and won't be 
useful.  We can use the feedback to refine the behavior and decide which of the 
many proposed improvements are most important.
    
    We've been discussing pros and cons of this approach for quite a while, and 
I don't think we're going to get to a point where we think the approach is 
*perfect*.  I think we should expect to iterate on this more in the future as 
folks use it.
    
    Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to