Github user tillrohrmann commented on a diff in the pull request:
https://github.com/apache/flink/pull/6192#discussion_r199166104
--- Diff:
flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManager.java ---
@@ -334,8 +335,11 @@ public void onContainersCompleted(final
List<ContainerStatus> list) {
if (yarnWorkerNode != null) {
// Container completed
unexpectedly ~> start a new one
final Container container =
yarnWorkerNode.getContainer();
-
requestYarnContainer(container.getResource(),
yarnWorkerNode.getContainer().getPriority());
-
closeTaskManagerConnection(resourceId, new
Exception(containerStatus.getDiagnostics()));
+ // check WorkerRegistration
status to avoid requesting containers more than required
+ if
(checkWorkerRegistrationWithResourceId(resourceId)) {
--- End diff --
Yes, I think it is not possible to distinguish between a container which
was released but has not been completed before a recovery and a container
failure just after recovery without some kind of state (in both cases we
retrieve the containers from the previous attempt and get a
onContainerCompleted signal from the container).
My question would be how often does it happen that we run into this
situation. Releasing a container and failing immediately afterwards should
happen fairly rarely, I would assume. Moreover, at some point the an idle
`TaskManager` should be released and, thus, also the underlying container.
---