Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17088
In this particular case are your map tasks fast or slow. If they are really
fast rerunning everything now makes sense, if each of those took 1 hour+ to
run, failing all when they don't need to be just wastes time and resource.
Rolling upgrades can take longer then 15 seconds to restart NMs. You can have
intermittent issues that last > 1 minute. If it took 1 hour to generate that
output I want it to retry really hard before failing all of those. Users
aren't going to tune each individual job unless they really have to and it
might very per stage. Really it should use cost base analysis on how long
those tasks ran but that gets more complicated.
It is also possible that some reduce tasks have already fetched the data
from those nodes and succeeded and you wouldn't have to rerun all tasks on that
host. Due to the way Spark cancels stages on fetch failure whether the reduce
tasks from the stage finish before the map can be rerun is very timing
dependent. You could end up doing a lot more work then necessary so the
question is whether that is ok compared to the cost of not doing this and
allowing the application to take a bit longer. Note that if you rerun all the
maps and they didn't need to be you might cause the app to take longer too.
How often do you see issues with node managers going down for other then
transient issues?
We very rarely see this. Most of the issues we see are transient and have
found with other application type (TEZ, MR) that rerunning all is worse because
its normally a transient issue. Those application schedulers are a bit
different though so not 100% comparable. If the whole node goes down then yarn
will inform you the node is lost. Yes that does take like 10 minutes though.
Are you seeing jobs fail due to this or just take longer?
I realize the job could fail is the node is really down and we get enough
failures across stages. It does seem like we should do better at this but I'm
not sure invalidating all the outputs on a single fetch failure makes sense
either.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]