Github user kayousterhout commented on the issue:

    https://github.com/apache/spark/pull/15326
  
    @erenavsarogullari @markhamstra I just looked at this further and I 
actually think this could be an issue:
    
    (1) The first attempt for a stage fails with a fetch failure.  The 
associated TaskSetManager is marked as a zombie but some tasks are still 
running, so removeSchedulable isn't called yet (it gets called in 
TaskSchedulerImpl only after all running tasks in the stage have finished).
    
    (2)  The map stage re-runs and a new attempt for the stage begins.  This 
attempt has the same stage ID, so will have the same schedulable name.
    
    I remember having a long discussion about (1) a while ago (and when we 
should call removeSchedulable) and decided it was most "fair" to call it only 
after all running tasks complete, because running tasks should be counted 
towards the pools share even when they're for zombie task sets.
    
    I think in this case, the last-schedulable-wins policy that currently 
exists seems better, although still wrong.  I'd argue we should first fix this 
bug (by giving each TaskSetManager a unique name), in a separate PR, and then 
do the fix to this PR that you suggested, Mark.  The other fix seems like it 
should be relatively simple.  @markhamstra thoughts?  Does that seem reasonable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to