xintongsong commented on a change in pull request #17362:
URL: https://github.com/apache/flink/pull/17362#discussion_r717176438



##########
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/ResourceManager.java
##########
@@ -492,7 +492,8 @@ protected void onWorkerRegistered(WorkerType worker) {
 
     @Override
     public void disconnectTaskManager(final ResourceID resourceId, final 
Exception cause) {
-        closeTaskManagerConnection(resourceId, cause);
+        closeTaskManagerConnection(resourceId, 
cause).ifPresent(ResourceManager.this::stopWorker);

Review comment:
       For the record, there are several cases where a TaskManager may try to 
actively close the connection to ResourceManager:
   1. Registration failure. Most likely `closeTaskManagerConnection` should 
return `empty`, so that `stopWorker` will not be called. There could be rare 
cases where RM sees TM is registered successfully while TM does not see it 
(e.g., due to ask timeout). In such cases, the TM will be stopped and replaced.
   2. Heartbeat timeout. If RM is still functioning, the TM will be removed 
anyway. Otherwise, a new RM is expected and TM can reconnect.
   3. RM leader change. The non-leader RM will be closed and should not be able 
to remove pod / container.
   4. TM is terminating. TM termination can also hang, leading to the same 
resource leak problem if they are not stopped externally.
   
   Among the cases, trying to stop the worker after disconnecting only affects 
the rare conditions in (1) where RM and TM are out-of-sync regarding whether a 
registration is success or not. On the other hand, if we do not call 
`stopWorker` here, we may run into similar container / pod leaks if the TM 
termination stuck (4).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to