Github user mridulm commented on the pull request:
https://github.com/apache/spark/pull/10045#issuecomment-161137821
@kayousterhout I have a job with 200k tasks where some tasks fail for as
much as 22 times (no kidding - actual number) and then succeeds :-)
Specific examples (not sure if all of them are relevant) :
a) From an earlier work, GPU resource exhaustion would fail the task - but
it would succeed if rescheduled.
b) Reducers failing due to fetch failures but succeed on reschedule
(preferred locality set btw - so gets rescheduled on same node).
c) JNI related failures are fairly common in some of the libraries we use.
d) Direct buffer exhaustion which works on reschedule (race between tasks).
In some of these (a, b) the cost of moving the data to a different node is
fairly expensive.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]