wangyang0918 edited a comment on pull request #15524:
URL: https://github.com/apache/flink/pull/15524#issuecomment-846785372


   @xintongsong Thanks for the thorough analysis and solution. It makes sense 
to me to scope out the multiple leader sessions related code changes. After 
this PR, we could guarantee that only the active JobManager could start/stop 
the TaskManager pods/containers, as well as deregistering the application from 
cluster management. This is what we want to achieve originally in FLINK-21667.
   
   To support multiple leader sessions("grant leadership" -> "revoke" -> "grant 
again") and make the JobManager process not crashed, I am not sure whether 
"Yarn duplicated register" is the only problem. We could have a follow-up 
ticket for more investigation.
   
   > How to handle resource changes between two leader sessions?
   
   I lean to the second solution. Using `YarnClient#getContainers` to get the 
full running containers and `AMRMClientAsync` heartbeat to get the incremental 
changes. It just feels like the K8s "list and watch" mechanism.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to