Github user jerryshao commented on the pull request:

    https://github.com/apache/spark/pull/10794#issuecomment-186495337
  
    Hi @andrewor14 , in our implementation, currently when AM is failed all the 
related executors will be exited automatically, and driver will be notified 
with disconnection events and remove the related states. After then when the AM 
restarts, new executors will be registered into driver. 
    
    Here we assume all the executors will be exited before AM restarts.  I'm 
afraid AM will possibly be restarted before all the executors are exited. To 
try to fix this, here in #9963 I cleaned `executorDataMap` when reset is 
invoked, but it is only for dynamic allocation enabled situation. like what 
@lianhuiwang mentioned, for dynamic allocation disabled situation we should 
also clean this state.
    
    Beside, what I'm thinking is that there might be conflicted executor id 
issue, since executor id will be recalculated when AM restarts, which will be 
conflicted with old one. The issue may not only be in the driver side, but also 
in the external shuffle service (since now executor shuffle service requires 
executor id to do some recovery works), but I haven't yet met such issue till 
now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to