Github user jerryshao commented on the issue:

    https://github.com/apache/spark/pull/17113
  
    @tgravescs , thanks a lot for your comments.
    
    Actually the issue here is a simulated one from my test cluster, I didn't 
get an issue report from real customers. 
    
    Yes, in most of the cases Shuffle fetch failure is transient and could be 
recovered through retry, but for some small jobs this gap between failure and 
recovery is enough for the job to be failed with retry.
    
    The difficult thing here is that for fetch failure, Spark immediately abort 
the stage without any retry of the tasks, so unlike normal failures of tasks, 
we may have no chance to monitor several fetch failures until we decide to 
blacklist it. That's why in this PR I immediately blacklist the executors/nodes 
in the application level after fetch failure. The solution is too strict that 
it will also blacklist some transient failure executors, that's what I mainly 
concerned about.
    
    So here looking for any comments, greatly appreciated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to