Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/17088
>> Rolling upgrades can take longer then 15 seconds to restart NMs. You can
have intermittent issues that last > 1 minute. If it took 1 hour to generate
that output I want it to retry really hard before failing all of those. Users
aren't going to tune each individual job unless they really have to and it
might very per stage.
How often are you doing rolling upgrades? I really think in those cases we
should be tuning the shuffle fetch configurations to allow for rolling
upgrades. Even without this change, current model un-registers files in case
of fetch failure, so you might be losing a lot of work already and in worst
case you can still loose all the files present in a host.
>> It is also possible that some reduce tasks have already fetched the data
from those nodes and succeeded and you wouldn't have to rerun all tasks on that
host.
I am not sure if I get your point here, but this will not rerun reduce
tasks that are already fetched data from those nodes and succeeded.
We are seeing this issue very frequently mainly because of node reboot. We
are trying to scale Spark to cluster with thousands of machines and probability
of seeing failures and reboots during long running jobs is very high.
>> Are you seeing jobs fail due to this or just take longer?
We are seeing both. As @squito mentioned, we only give 4 chances for a
stage to be retried so even one node reboot can trigger 4 retries and cause the
job to fail. In case the job gets lucky, the job takes significantly longer
than expected in case of fetch failure because of multiple retries of a stage
and the way retries are handled by the scheduler is not elegant right now - It
does not allow multiple attempts of a stage to run concurrently, which is a
separate issue I will address in another PR.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]