Github user mridulm commented on the issue:

    https://github.com/apache/spark/pull/15249
  
    
    Thinking more, and based on what @squito mentioned, I was considering the 
following :
    
    Since we are primarily dealing with executor or nodes which are 'bad' as 
opposed to recoverable failures due to resource contention, prevention of 
degenerate corner cases which existing blacklist is for, etc :
    
    Can we assume a successful task execution on a node will imply healthy node 
?
    What about at executor level ?
    
    Proposal is to keep the pr as is for the most part, but :
    - Clear nodeToExecsWithFailures when an task on an node succeeds. Same for 
nodeToBlacklistedTaskIndexes.
    - Not sure if we want to reset execToFailures for an executor (not clearing 
would imply we are handling resource starvation case implicitly imo).
    - If possible - allow for speculative tasks to be scheduled on blacklisted 
nodes/executors if it is possible for countTowardsTaskFailures to be overriden 
to false in those cases (if not, ignore this - since it will add towards number 
of failures per app).
     
    The rationale behind this is that successful tasks indicate past failures 
were not indicative of bad nodes/executors, but rather transient failures. And 
speculative tasks also sort of work as probe tasks to determine if the 
node/executor has recovered and is healthy.
    
    I hope I am not missing anything - any thoughts @squito, @kayousterhout, 
@tgravescs ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to