Github user tillrohrmann commented on the issue:
https://github.com/apache/flink/pull/2257
Hi @mxm, I've changed the implementation such that we no longer need the
`containersLaunched` map in the `YarnFlinkResourceManager`. Instead we're not
clearing the `registeredWorkers` map in the `FlinkResourceManager` when the
`JobManager` loses leadership. Thus, the `registeredWorkers` field denotes the
successfully started task managers (and the containers they are running in).
Additionally I reintroduced the reconnect resource manager functionality in
the job manager. This should make sure that the resource manager is eventually
notified about newly registered resources. In the current implementation,
however, the resource manager will always accept the register resource
messages. So only if the message gets lost and thus triggers a timeout
exception, the reconnect resource manager message is sent.
Would be great if you could take another look at the changes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---