wangyang0918 commented on pull request #15524:
URL: https://github.com/apache/flink/pull/15524#issuecomment-846785372
@xintongsong Thanks for the thorough analysis and solution. It makes sense
to me to scope out the multiple leader sessions related code changes. After
this PR, we could guarantee that only the active JobManager could start/stop
the TaskManager pods/containers, as well as deregistering the application from
cluster management. This is what we want to achieve originally in FLINK-21667.
To support multiple leader sessions("grant leadership" -> "revoke" -> "grant
again") and make the JobManager process not crashed, I am not sure whether
"Yarn duplicated register" is the only problem. We could have a follow-up
ticket for more investigation.
> How to handle resource changes between two leader sessions?
I lean to the second solution. Using `YarnClient#getContainers` to get the
full running containers and `AMRMClientAsync` heartbeat to get the incremental
changes. It just feels like the K8s "list and watch" mechanism.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]