Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17088
> (a) even the existing behavior will make you do unnecessary work for
transient failures and (b) this just slightly increases the amount of work that
has to be repeated for those transient failures.
Yes a transient fetch failure always causes some more work because you are
going to re-run some map tasks, but the question comes down to how much work it
is doing. For your b) you can't really say it only "slightly increases" the
work because its going to be highly dependent up on the timing of when things
finish, how long maps take, and how long the reducers take.
Please correct me if I've missed something in the spark scheduler related
to this.
- The ResultTasks (lets call them reducers) say in Stage 1.0 are running.
One of them gets a fetchFailure. This restarts the ShuffleMapTasks for that
executor in Stage 0.1. If during the time those maps are running other
reducers fetch fail, those get taken into account and will be rerun in same
stage 0.1. If a reducer happened to finish successfully while that map Stage
0.1 is running, it ends up failing with a commit denied (at least with the
hadoop writer, others might not). If however that reduce was longer lived and
the map stage 0.1 finished and then that reducer finished successfully that
reduce would be a success.
This timing dependent completion seems very unpredictable (and perhaps a
bug in the commit logic but I would have to look more). If your maps take a
very long time and your reducers also take a long time then you don't really
want to re-run the maps. Of course once it starts stage 1.1 for the reducers
that didn't originally fail, if the nodemanager really is down, the new reducer
task could fail to fetch data that the old ones had already received, so it
would need to be re-run anyway even if the original reducer was still running
and would have succeeded.
Sorry if that doesn't make sense, much to complicated. There is also the
case where one reducer fails and you are using a static # of executors. So you
have 1 slots to rerun the maps. If you now say fail 3 maps instead of 1, its
going to potentially take 3 times as long to rerun those maps. My point here
is that I think its very dependent upon the conditions, could be faster, could
be slower. The question is which happens more often. Generally we see more
intermittent issues with nodemanager rather then them going down fully. If
they are going down is it due to them crashing or is the entire node going
down? If entire node going down are we handling that wrong or not as well as
we could? If its the nodemanager crashing I would say you need to look and see
why as that shouldn't happen very often.
I'm not sure which one could be better. Due to the way spark schedules I'm
ok with invalidating all as well, but think it would be better for us to fix
this right, which to me means not throwing away the work of the first stage
running reducers.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]