LiuPeien commented on pull request #15996: URL: https://github.com/apache/flink/pull/15996#issuecomment-847682716
If a NodeManager crashed when the Flink is running and we try to cancel the job at this time, we find the containers of the job cannot be released immediately. The root cause is NMClient tries to stop the containers on the dead NodeManager and the process gets stuck because it can't connect to the dead NodeManager. Due to the clean-up process is serial and synchronous, when the process gets stuck, the containers on the normal NodeManagers also cannot be stopped. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
