Github user markgrover commented on the pull request:

    https://github.com/apache/spark/pull/8007#issuecomment-133196307
  
    > Your assumption probably holds for the preemption case, since it's YARN 
killing the container. But I can imagine that if the container exits by itself, 
it might be possible for the disconnect to reach the driver endpoint and the 
GetExecutorLossReason message to reach the AM before the NM has had a chance to 
process the container exit and communicate that to the RM.
    
    There is still something I don't quite understand here so posting for my 
self-educational purposes.
    
    1) In my case, it was still YARN killing the container, albeit, not for 
pre-emption but for other reasons and I would have hoped/assumed that the AM 
state would have updated by RM by the time we request the loss reason. It seems 
like that may not always be the case, which I wasn't expecting.
    
    2) I was getting two GetExecutorLossReason messages in the AM, but the only 
place that can send it is the scheduler backend when it gets the disconnected 
event. That in my opinion is orthogonal to us figuring out the answer to (1) 
above So, for shits and giggles, I am going to try to keep the entries around 
in `pendingDisconnectedExecutors` and see if that prevents the two calls and 
will post back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to