Github user sitalkedia commented on the issue:
https://github.com/apache/spark/pull/17297
@kayousterhout - I understand your concern and I agree that canceling the
running tasks is definitely a simpler approach, but this is very inefficient
for large jobs where tasks can run for hours. In our environment where fetch
failures are common, this change will not only improve the performance of the
jobs in case of fetch failure, this also helps reliability. If we cancel all
running reducers, we might end of in a state where jobs will not make any
progress at all in case of frequent fetch failure, because they will just
flip-flop between two stage.
Comparing this approach to how Hadoop handles fetch failure, it does not
fail any reducer in case it detects any map output missing. The reducers just
continue processing output from other mappers while the missing output is being
recomputed concurrently. This approach give Hadoop a big edge over Spark for
long running jobs with multiple fetch failure. This change is one step towards
making Spark robust against fetch failure, we would eventually want to have the
hadoop model, where we would not fail any task in case of map output missing.
Regarding the approach, please let me know if you can think of some way to
reduce the complexity of this change.
cc -@markhamstra, @rxin, @sameeragarwal
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]