Github user kayousterhout commented on the issue:
https://github.com/apache/spark/pull/15326
@erenavsarogullari @markhamstra I just looked at this further and I
actually think this could be an issue:
(1) The first attempt for a stage fails with a fetch failure. The
associated TaskSetManager is marked as a zombie but some tasks are still
running, so removeSchedulable isn't called yet (it gets called in
TaskSchedulerImpl only after all running tasks in the stage have finished).
(2) The map stage re-runs and a new attempt for the stage begins. This
attempt has the same stage ID, so will have the same schedulable name.
I remember having a long discussion about (1) a while ago (and when we
should call removeSchedulable) and decided it was most "fair" to call it only
after all running tasks complete, because running tasks should be counted
towards the pools share even when they're for zombie task sets.
I think in this case, the last-schedulable-wins policy that currently
exists seems better, although still wrong. I'd argue we should first fix this
bug (by giving each TaskSetManager a unique name), in a separate PR, and then
do the fix to this PR that you suggested, Mark. The other fix seems like it
should be relatively simple. @markhamstra thoughts? Does that seem reasonable?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]