LiuPeien commented on pull request #15996:
URL: https://github.com/apache/flink/pull/15996#issuecomment-847682716


   If a NodeManager crashed when the Flink is running and we try to cancel the 
job at this time, we find the containers of the job cannot be released 
immediately. The root cause is NMClient tries to stop the containers on the 
dead NodeManager and the process gets stuck because it can't connect to the 
dead NodeManager. Due to the clean-up process is serial and synchronous, when 
the process gets stuck, the containers on the normal NodeManagers also cannot 
be stopped.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to