Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/15249
Thinking more, and based on what @squito mentioned, I was considering the
following :
Since we are primarily dealing with executor or nodes which are 'bad' as
opposed to recoverable failures due to resource contention, prevention of
degenerate corner cases which existing blacklist is for, etc :
Can we assume a successful task execution on a node will imply healthy node
?
What about at executor level ?
Proposal is to keep the pr as is for the most part, but :
- Clear nodeToExecsWithFailures when an task on an node succeeds. Same for
nodeToBlacklistedTaskIndexes.
- Not sure if we want to reset execToFailures for an executor (not clearing
would imply we are handling resource starvation case implicitly imo).
- If possible - allow for speculative tasks to be scheduled on blacklisted
nodes/executors if it is possible for countTowardsTaskFailures to be overriden
to false in those cases (if not, ignore this - since it will add towards number
of failures per app).
The rationale behind this is that successful tasks indicate past failures
were not indicative of bad nodes/executors, but rather transient failures. And
speculative tasks also sort of work as probe tasks to determine if the
node/executor has recovered and is healthy.
I hope I am not missing anything - any thoughts @squito, @kayousterhout,
@tgravescs ?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]