Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/9154#issuecomment-149289335
  
    To give a specific example, suppose task t1 has preferred locations on 
executor e1 (on host h1), e2 (also on host h1) and e3 (on host h2).
    
    The data structures will look like:
    
    pendingTasksForExecutor: {e1: t1, e2: t1, e3: t1}
    pendingTasksForHost: {h1: t1, h2: t1}
    
    We've agreed that the "addPendingTask" call is irrelevant for tasks that 
are currently running (because addPendingTask is called for any running tasks 
in handleFailedTask), so let's say t1 hasn't been run yet.
    
    Now suppose executor e2 dies.  We never remove any entries from 
pendingTasksForExecutor or pendingTasksForHost (not in addPendingTask, nor 
anyplace else, as far as I can tell; we still won't schedule things on the died 
executor, because the TaskSetManager will never get a resource offer for it).  
addPendingTask will "readd" entries for each of t1's preferred locations 
(including the lost executor -- we don't check whether the executor is alive 
when updating the map entries).  However, all of these locations were already 
added above, so this call has no effect.
    
    Which part of this reasoning do you think is incorrect?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to