Github user kayousterhout commented on the pull request:
https://github.com/apache/spark/pull/8760#issuecomment-161070513
Before fixing style issues on this change, I think it's worth considering
whether this is the right approach for blacklisting. Based on discussion here
and in #3541 and #10045, it seems like there are a few high-level goals here:
(1) Minimize the required user configuration (ideally this should just work
without any configuration)
(2) If an executor/host isn't working well, stop using it for the
particular job
(3) If an executor/host is failing for multiple jobs, stop using it across
all jobs
(4) Eventually re-try the executor/host, since the failure may be transient
(5) Don't overcorrect for bad tasks / jobs (that fail regardless of whether
they're run)
My concern with the approach in this PR is that it requires a lot of user
configuration, and as has been discussed in the various PRs, the appropriate
blacklist timeout is very workload-dependent and type-of-failure-dependent. I
wonder if it would make sense to do something like an exponentially increasing
timeout (where each consecutive failure triggers a longer timeout) to make this
have a lower configuration overhead. I pinged @mateiz to see if he has any
other ideas about how to do this gracefully.
The other issue is (5). One way to handle that is to only "count" a task
failure if the task fails on the executor *and* succeeds elsewhere.
It would be great if we could make blacklisting work well out-of-the-box,
so I think it's worth putting some thought into the right approach here. It
would be useful to get others folks' feedback about whether these are the right
goals and if there are better ideas for how to achieve them.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]