Github user sihuazhou commented on the issue:
https://github.com/apache/flink/pull/5931
Hi @GJL , is it possible that the reason is the same as in the previous PR
for this ticket, that is even the container setup successfully and connect with
ResourceManager successfully, but the TM was killed before connecting to
JobManager successfully. In this case, even though there are enough TMs,
JobManager won't fire any new request, and the ResourceManager doesn't know
that the container it assigned to JobManager has been killed either, so both
JobManager & ResourceManager won't do anything but waiting for timeout... What
do you think?
---