Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/17088
>> The ResultTasks (lets call them reducers) say in Stage 1.0 are running.
One of them gets a fetchFailure. This restarts the ShuffleMapTasks for that
executor in Stage 0.1. If during the time those maps are running other reducers
fetch fail, those get taken into account and will be rerun in same stage 0.1.
As per my understanding of the code, that is not the case. Currently, the
task scheduler does not allow running multiple concurrent of a particular stage
(see -
https://github.com/sitalkedia/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L172).
So we wait till the stage 0.1 finishes and rerun the failed maps in another
retry stage 0.2. This adds significant latency to the job run.
>> Generally we see more intermittent issues with nodemanager rather then
them going down fully. If they are going down is it due to them crashing or is
the entire node going down? If entire node going down are we handling that
wrong or not as well as we could?
In our case, we see fetch failure because of multiple reasons like node
reboot, disk going bad or any network issue. It's very difficult for the
cluster manager to detect these kinds of issues and inform the driver.
>> If we need something more short term I think it would be better to wait
for a at least a couple fetch failures or a % of reducers failed before
invalidating all of it.
I know its not ideal, but how about making this behavior configurable?
i.e., only unregister all outputs on a host if the configuration is enabled
otherwise leave the existing behavior?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]