[
https://issues.apache.org/jira/browse/FLINK-15758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17026626#comment-17026626
]
Andrey Zagrebin edited comment on FLINK-15758 at 1/30/20 12:05 PM:
-------------------------------------------------------------------
After looking more into this topic, the problem could be resolved with the
following step:
* Simply MemoryManager by removing KeyedBudgetManager as the managed memory is
only on-heap now
* Implement custom unsafe memory control, similar to what JVM does to control
direct memory limit:
** Use AtomicLong to control allocated memory and optimistic allocation retry
(like in nio.Bits#tryReserveMemory)
** Try to speed up running GC phantom ref cleaners with
SharedSecrets.getJavaLangRefAccess and fallback to full GC if allocation fails
before trowing OutOfMemoryError (like in nio.Bits#reserveMemory)
The last step will require wrapping of JavaLangRefAccess logic with the
reflection calls as it was relocated in Java 9 and the API has changed.
was (Author: azagrebin):
After looking more into this topic, the problem could be resolved with the
following step:
* Simply MemoryManager by removing KeyedBudgetManager as the managed memory is
only on-heap now
* Implement custom unsafe memory control, similar to what JVM does to control
direct memory limit:
** Use AtomicLong to control allocated memory and optimistic allocation retry
(like in nio.Bits#tryReserveMemory)
** Try to speed up running GC phantom ref cleaners with
SharedSecrets.getJavaLangRefAccess and fallback to full GC if allocation fails
(like in nio.Bits#reserveMemory)
> Investigate potential out-of-memory problems due to managed unsafe memory
> allocation
> ------------------------------------------------------------------------------------
>
> Key: FLINK-15758
> URL: https://issues.apache.org/jira/browse/FLINK-15758
> Project: Flink
> Issue Type: Task
> Components: API / DataSet, Runtime / Task
> Reporter: Andrey Zagrebin
> Assignee: Andrey Zagrebin
> Priority: Critical
> Fix For: 1.11.0
>
>
> After FLINK-13985, managed memory is allocated from UNSAFE, not as direct nio
> buffers as before 1.10.
> in FLINK-14894, there was an attempt to release this memory only when all
> Java handles of the unsafe memory are about to be GC'ed. It is similar to how
> it was with direct nio buffers before 1.10 but the unsafe memory is not
> tracked by direct memory limit (-XX:MaxDirectMemorySize). The problem is that
> over-allocating of unsafe memory will not hit the direct limit and will not
> cause GC immediately which will be the only way to release it. In this case,
> it causes out-of-memory failures w/o triggering GC to release a lot of
> potentially already unused memory.
> We have to investigate further optimisations, like:
> * directly monitoring phantom reference queue of the cleaner (if JVM detects
> quickly that there are no more reference to the memory) and explicitly
> release memory ready for GC asap, e.g. after Task exit
> * monitor allocated memory amount and block allocation until GC releases
> occupied memory instead of failing with out-of-memory immediately
> cc [~sewen] [~trohrmann]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)