azagrebin commented on a change in pull request #10682: [FLINK-15247][Runtime] 
Wait for all slots to be free before task executor services shutdown upon 
stopping
URL: https://github.com/apache/flink/pull/10682#discussion_r365830481
 
 

 ##########
 File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java
 ##########
 @@ -376,7 +372,7 @@ private void 
handleStartTaskExecutorServicesException(Exception e) throws Except
 
                return FutureUtils
                        .runAfterwards(
-                               
taskCompletionTracker.failIncompleteTasksAndGetTerminationFuture(),
+                               taskSlotTable.freeAllSlots(new 
FlinkException("The task executor is shutting down.")),
 
 Review comment:
   Indeed, it is complicated. I tried to avoid changing too much the process of 
freeing slots. At the moment it involves quite some steps:
   
   - JM calls freeSlot and puts the slot into the releasing state
   - task update state final (TE)
   - remove task (table)
   - if slot is in releasing state and it has no running tasks
     - the async table.freeSlot and
     - notifies RM/close JM connection (slot actions impl in TE)
   
   It would be nice that `table.freeSlot` would not have to go through TE but I 
guess somehow `table.freeSlot and notifies RM/close JM connection` has to be 
scheduled async into the main thread atomically.
   
   True, it would be nicer if task management was part of the slot as task 
belongs to it then it could be tested only there and slot freeing/closing would 
guarantee interruption, now it is mixed into the table class.
   
   The first approach in this PR also kept slot/table not dealing with 
threading and futures leaving TM taking care of it. I agree that the suggested 
approach separates better responsibilities w/o changing the existing process of 
freeing slot too much and introducing async behaviour to slot/table should not 
be a big problem now.
   
   I implemented the suggested approach with `closeAsync`. Some final cleanup 
of slot/table (timeService.stop and  memoryManager.shutdown) happens in sync 
callbacks to the termination of tasks w/o scheduling it to the main thread of 
TM. Also, the final cleanup of table internal data structures after slot 
freeing potentially happens after the closing future but better to be scheduled 
before. Although this does not look implicitly a problem now, it is better to 
rethink more the design and responsibilities by the next opportunity and keep 
all actions thread-confined to avoid accidental putting of critical stuff into 
the callbacks w/o proper synchronisation. As discussed, one solution could be 
that slot/table get main thread executor in future and become more independent 
from TM.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to