azagrebin opened a new pull request #12980: URL: https://github.com/apache/flink/pull/12980
`UnsafeMemoryBudget#verifyEmpty`, called on slot freeing, needs time to wait on GC of all allocated/released managed memory. If there are a lot of segments to GC then it can take time to finish the check. If slot freeing happens in RPC thread, the GC waiting can block it and TM risks to miss its heartbeat. Another problem is that after `UnsafeMemoryBudget#RETRIGGER_GC_AFTER_SLEEPS`, `System.gc()` is called for each attempt to run a cleaner even if there are already detected cleaners to run. This leads to triggering a lot of unnecessary GCs in background. The PR offloads the verification into a separate thread and calls `System.gc()` only if memory cannot be reserved and there are still no cleaners to run after long waiting. The timeout for normal memory reservation is increased to 2 second. The full reservation, used for verification, gets 2 minute timeout. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org