Github user sitalkedia commented on the issue:

    https://github.com/apache/spark/pull/17088
  
    >> The ResultTasks (lets call them reducers) say in Stage 1.0 are running. 
One of them gets a fetchFailure. This restarts the ShuffleMapTasks for that 
executor in Stage 0.1. If during the time those maps are running other reducers 
fetch fail, those get taken into account and will be rerun in same stage 0.1.
    
    As per my understanding of the code, that is not the case. Currently, the 
task scheduler does not allow running multiple concurrent of a particular stage 
(see - 
https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L172).
 So we wait till the stage 0.1 finishes and rerun the failed maps in another 
retry stage 0.2. This adds significant latency to the job run. 
    
    
    >> Generally we see more intermittent issues with nodemanager rather then 
them going down fully. If they are going down is it due to them crashing or is 
the entire node going down? If entire node going down are we handling that 
wrong or not as well as we could? 
    
    In our case, we see fetch failure because of multiple reasons like node 
reboot, disk going bad or any network issue.  It's very difficult for the 
cluster manager to detect these kinds of issues and inform the driver. 
    
    >>  If we need something more short term I think it would be better to wait 
for a at least a couple fetch failures or a % of reducers failed before 
invalidating all of it.
    
    I know its not ideal, but how about making this behavior configurable? 
i.e., only unregister all outputs on a host if the configuration is enabled 
otherwise leave the existing behavior?
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to