Github user kayousterhout commented on the issue:
https://github.com/apache/spark/pull/15249
@tgravescs no decision here yet.
@mridulm the main question for (2), though, is are the consequences a
deal-breaker? It doesn't seem disastrous if a task needs to run on a non-local
machine instead of getting re-tried on a machine where it already failed but
might succeed later on. Also, it seems likely that the task has a higher
probability of completing sooner if it runs on another machine compared to
re-running (after a delay) on a machine where it already failed. What are the
situations you're most concerned about with the new approach?
If we leave the existing mechanism in, one concern (besides the additional
complexity) is the interaction between the new host-level blacklisting and the
old executor-level blacklisting. There could be a scenario where the
executor-level timeout keeps tasks from getting re-tried on the same executor
for some period of time, so they run on other executors on the same host, which
causes the host to be permanently blacklisted, so the fact that the executor
blacklist would eventually re-allow the task is irrelevant. I think we'd need
to change the old executor blacklist timeout to be a host blacklist timeout for
this to work.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]