[jira] [Updated] (FLINK-15758) Investigate potential out-of-memory problems due to managed unsafe memory allocation

Andrey Zagrebin (Jira) Tue, 28 Jan 2020 07:22:29 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-15758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andrey Zagrebin updated FLINK-15758:
------------------------------------
    Description: 
After FLINK-13985, managed memory is allocated from UNSAFE, not as direct nio 
buffers as before 1.10.

in FLINK-14894, there was an attempt to release this memory only when all Java 
handles of the unsafe memory are about to be GC'ed. It is similar to how it was 
with direct nio buffers before 1.10 but the unsafe memory is not tracked by 
direct memory limit (-XX:MaxDirectMemorySize). The problem is that 
over-allocating of unsafe memory will not hit the direct limit and will not 
cause GC immediately which will be the only way to release it. In this case, it 
causes out-of-memory failures w/o triggering GC to release a lot of potentially 
already unused memory.

If we should investigate further optimisations, like:
 * directly monitoring phantom reference queue of the cleaner (if JVM detects 
quickly that there are no more reference to the memory) and explicitly release 
memory ready for GC asap, e.g. after Task exit
 * monitor allocated memory amount and block allocation until GC releases 
occupied memory instead of failing with out-of-memory immediately

cc [~sewen] [~trohrmann]

  was:
After FLINK-13985, managed memory is allocated from UNSAFE, not as direct nio 
buffers as before 1.10.

After FLINK-14894, the release of this memory happens only when all Java 
handles of the unsafe memory are about to be GC'ed. It is similar to how it was 
with direct nio buffers before 1.10 but the unsafe memory is not tracked by 
direct memory limit (-XX:MaxDirectMemorySize). The potential downside can be 
that over-allocating of unsafe memory will not hit the direct limit and will 
not cause GC immediately which will be the only way to release it. In this 
case, it can cause out-of-memory failures w/o triggering GC to release a lot of 
potentially already unused memory.

If we should verify whether the delayed release is a problem then we can 
investigate further optimisations, like:
 * directly monitoring phantom reference queue of the cleaner (if JVM detects 
quickly that there are no more reference to the memory) and explicitly release 
memory ready for GC asap, e.g. after Task exit
 * monitor allocated memory amount and block allocation until GC releases 
occupied memory instead of failing with out-of-memory immediately

cc [~sewen] [~trohrmann]


> Investigate potential out-of-memory problems due to managed unsafe memory 
> allocation
> ------------------------------------------------------------------------------------
>
>                 Key: FLINK-15758
>                 URL: https://issues.apache.org/jira/browse/FLINK-15758
>             Project: Flink
>          Issue Type: Task
>          Components: API / DataSet, Runtime / Task
>            Reporter: Andrey Zagrebin
>            Priority: Critical
>             Fix For: 1.11.0
>
>
> After FLINK-13985, managed memory is allocated from UNSAFE, not as direct nio 
> buffers as before 1.10.
> in FLINK-14894, there was an attempt to release this memory only when all 
> Java handles of the unsafe memory are about to be GC'ed. It is similar to how 
> it was with direct nio buffers before 1.10 but the unsafe memory is not 
> tracked by direct memory limit (-XX:MaxDirectMemorySize). The problem is that 
> over-allocating of unsafe memory will not hit the direct limit and will not 
> cause GC immediately which will be the only way to release it. In this case, 
> it causes out-of-memory failures w/o triggering GC to release a lot of 
> potentially already unused memory.
> If we should investigate further optimisations, like:
>  * directly monitoring phantom reference queue of the cleaner (if JVM detects 
> quickly that there are no more reference to the memory) and explicitly 
> release memory ready for GC asap, e.g. after Task exit
>  * monitor allocated memory amount and block allocation until GC releases 
> occupied memory instead of failing with out-of-memory immediately
> cc [~sewen] [~trohrmann]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-15758) Investigate potential out-of-memory problems due to managed unsafe memory allocation

Reply via email to