azagrebin commented on a change in pull request #10682: [FLINK-15247][Runtime]
Wait for all slots to be free before task executor services shutdown upon
stopping
URL: https://github.com/apache/flink/pull/10682#discussion_r365830481
##########
File path:
flink-runtime/src/main/java/org/apache/flink/runtime/taskexecutor/TaskExecutor.java
##########
@@ -376,7 +372,7 @@ private void
handleStartTaskExecutorServicesException(Exception e) throws Except
return FutureUtils
.runAfterwards(
-
taskCompletionTracker.failIncompleteTasksAndGetTerminationFuture(),
+ taskSlotTable.freeAllSlots(new
FlinkException("The task executor is shutting down.")),
Review comment:
Indeed, it is complicated. I tried to avoid changing too much the process of
freeing slots. At the moment it involves quite some steps:
- JM calls freeSlot and puts the slot into the releasing state
- task update state final (TE)
- remove task (table)
- if slot is in releasing state and it has no running tasks
- the async table.freeSlot and
- notifies RM/close JM connection (slot actions impl in TE)
It would be nice that `table.freeSlot` would not have to go through TE but I
guess somehow `table.freeSlot and notifies RM/close JM connection` has to be
scheduled async into the main thread atomically.
True, it would be nicer if task management was part of the slot as task
belongs to it then it could be tested only there and slot freeing/closing would
guarantee interruption, now it is mixed into the table class.
The first approach in this PR also kept slot/table not dealing with
threading and futures leaving TM taking care of it. I agree that the suggested
approach separates better responsibilities w/o changing the existing process of
freeing slot too much and introducing async behaviour to slot/table should not
be a big problem now.
I implemented the suggested approach with `closeAsync`. Some final cleanup
of slot/table (timeService.stop and memoryManager.shutdown) happens in sync
callbacks to the termination of tasks w/o scheduling it to the main thread of
TM. Also, the final cleanup of table internal data structures after slot
freeing potentially happens after the closing future but better to be scheduled
before. Although this does not look implicitly a problem now, it is better to
rethink more the design and responsibilities by the next opportunity and keep
all actions thread-confined to avoid accidental putting of critical stuff into
the callbacks w/o proper synchronisation. As discussed, one solution could be
that slot/table get main thread executor in future and become more independent
from TM.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services