[
https://issues.apache.org/jira/browse/FLINK-18646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161825#comment-17161825
]
Andrey Zagrebin commented on FLINK-18646:
-----------------------------------------
I do not think that {{System.gc()}} is actually doing something all the time,
it is just a hint to JVM that it is allowed to spent more time for GC. We can
try to remove it to see whether it helps.
If {{tryRunPendingCleaners == true}} there will be no {{sleeps}} and the outer
loop is basically {{while (tryRunPendingCleaners)}} plus {{System.gc()}},
which is almost noop hint according to logs. Also generally, it is not optimal
for each reserve operation to wait for all cleaners by {{while
(tryRunPendingCleaners)}} without trying to reserve because the reserve
operation might want to reserve just a bit and needs to wait for a couple of
cleaners.
Did you try it with Java 11? I took the code from Java 11
{{nio.Bits#reserveMemory}}.
The cleaners are already supposed to clean in some GC background threads. By
calling {{tryRunPendingCleaners}} , we just try to speed it up giving it
another thread and more CPU if the available memory is exhausted. Therefore, it
makes sense to allocate a thread for the check because otherwise it waits for
the full GC of the slot managed memory in RPC thread. I would try this in
{{TaskSlot}}:
CompletableFuture<Void> closeAsync(Throwable cause) {
.....
ExecutorService executor = Executors.newSingleThreadExecutor(); //
<------------
final CompletableFuture<Void> cleanupFuture = FutureUtils
.waitForAll(tasks.values().stream().map(TaskSlotPayload::getTerminationFuture).collect(Collectors.toList()))
.thenRunAsync(this::verifyMemoryFreed, executor) //
<------------
.thenRunAsync(() -> {
this.memoryManager.shutdown();
executor.shutdown(); // <------------
});
....
}
> Managed memory released check can block RPC thread
> --------------------------------------------------
>
> Key: FLINK-18646
> URL: https://issues.apache.org/jira/browse/FLINK-18646
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Task
> Affects Versions: 1.11.0
> Reporter: Andrey Zagrebin
> Priority: Major
>
> UnsafeMemoryBudget#verifyEmpty, called on slot freeing, needs time to wait on
> GC of all allocated/released managed memory. If there are a lot of segments
> to GC then it can take time to finish the check. If slot freeing happens in
> RPC thread, the GC waiting can block it and TM risks to miss its heartbeat.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)