KarmaGYZ commented on a change in pull request #11248: [FLINK-16299] Release
containers recovered from previous attempt in w…
URL: https://github.com/apache/flink/pull/11248#discussion_r386859531
##########
File path:
flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManager.java
##########
@@ -464,7 +472,15 @@ public void onContainerStarted(ContainerId containerId,
Map<String, ByteBuffer>
@Override
public void onContainerStatusReceived(ContainerId containerId,
ContainerStatus containerStatus) {
- // We are not interested in getting container status
+ // We fetch the status of the container from the previous
attempts.
+ if (containerStatus.getState() == ContainerState.NEW) {
Review comment:
I just have an offline discussion with Xintong and Taoyang.
The key point of this issue is: we need to release the container which is
not started in the previous attempt or is not useable anymore. When we try to
get the status of those containers:
- If it goes into `onContainerStatusReceived`: It means that the container
has been started in the previous attempt. No matter what is the state, these
containers will be released eventually by the existing `onStartContainerError`
or `onContainersCompleted`. There is no need to handle them.
- If it goes into `onGetContainerStatusError`: It means that the container
is not started in the previous attempt or other causes like NM lost. In such
cases, the container is not useable and we need to release it and remove it
from the `workerNodeMap`.
So, I will move the release logic here to onGetContainerStatusError.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services