[GitHub] [flink] LiuPeien commented on pull request #15996: FLINK-22663 Stop YARN NMClient get stuck if it try to stop the containers on dead NodeManagers

GitBox Tue, 25 May 2021 01:51:57 -0700


LiuPeien commented on pull request #15996:
URL: https://github.com/apache/flink/pull/15996#issuecomment-847682716



   If a NodeManager crashed when the Flink is running and we try to cancel the 
job at this time, we find the containers of the job cannot be released 
immediately. The root cause is NMClient tries to stop the containers on the 
dead NodeManager and the process gets stuck because it can't connect to the 
dead NodeManager. Due to the clean-up process is serial and synchronous, when 
the process gets stuck, the containers on the normal NodeManagers also cannot 
be stopped.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [flink] LiuPeien commented on pull request #15996: FLINK-22663 Stop YARN NMClient get stuck if it try to stop the containers on dead NodeManagers

Reply via email to