Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/13482
We should probably decouple the task scheduling and the executor lost
reason eventually, but that is a separate issue.
The only time I would see removing the notifyAll a problem is if they
increase the heartbeat timeout to a very large number, but it would have to be
close to the rpc timeout, which they just shouldn't do. Otherwise a couple of
extra seconds to reschedule the tasks in this failure case that is not the norm
shouldn't be a problem and as soon as one happens, it goes down to the 200ms
that this patch is suggesting anyway.
@rdblue does removing the notifyAll call solve your problem as well? That
seems like a much cleaner approach then notifying but then sleeping some time
again.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]