Github user kayousterhout commented on the issue:
https://github.com/apache/spark/pull/17297
To recap the issue that Imran and I discussed here, I think it can be
summarized as follows:
- A Fetch Failure happens at some time t and indicates that the map output
on machine M has been lost
- Consider some running task that's read x map outputs and still needs to
process y map outputs
- Scenario A: (PRO of this PR) If the output from M was in the x outputs
that are already read, we should keep running the task (as this PR does),
because the task already successfully fetched the output from the failed
machine. We don't do this currently, meaning we're throwing away the wasted
work.
- Scenario B: (CON of this PR) If the output from M was in the y outputs
that have not yet been read, then we should cancel the task, because the task
won't learn about the new location for the re-generated output of M (IIUC,
there's no functionality to do this now) so is going to fail later on. The
current code will re-run the task, which is what we should do. This code will
try to re-use the old task, which means the job will take longer to run because
the task will fail later on and need to be re-started.
If my description above is correct, then this PR is assuming that scenario
A is more likely than scenario B, but it seems to me that these two scenarios
are equally likely (in which case this PR provides no net benefit).
@sitalkedia what are your thoughts here / did I miss something in my
description above?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]