[jira] [Commented] (FLINK-13477) Containerized TaskManager killed because of lack of memory overhead

Xintong Song (JIRA) Fri, 02 Aug 2019 12:32:36 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-13477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899158#comment-16899158
 ]


Xintong Song commented on FLINK-13477:
--------------------------------------

Hi [~b.hanotte],

I'm not against bringing any changes until the FLIP is implemented. Just trying 
to provide some related information.

The FLIP I mentioned is actually planned for release 1.10. Since release 1.9 is 
already frozen, the earliest we can get changes in this issue released is also 
in 1.10. So maybe it makes sense to wait a bit for the FLIP doc and see how it 
works with this issue. The situation I'm trying to avoid here is that we make 
these changes now, and soon we have to rework the changes for the FLIP, even 
before the changes take effect in any single release.

It is also possible that, after a full discussion and voting in the community 
we decide not to accept the FLIP or postpone it to later releases. In that 
case, this issue should still be a good alternative solution for the next 
version.

> Containerized TaskManager killed because of lack of memory overhead
> -------------------------------------------------------------------
>
>                 Key: FLINK-13477
>                 URL: https://issues.apache.org/jira/browse/FLINK-13477
>             Project: Flink
>          Issue Type: Improvement
>          Components: Deployment / Mesos, Deployment / YARN
>    Affects Versions: 1.9.0
>            Reporter: Benoit Hanotte
>            Priority: Minor
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, the `-XX:MaxDirectMemorySize` parameter is set as:
> `MaxDirectMemorySize = containerMemoryMB - heapSizeMB`
> (see 
> [https://github.com/apache/flink/blob/7fec4392b21b07c69ba15ea554731886f181609e/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ContaineredTaskManagerParameters.java#L162])
> However as explained at
>  https://docs.oracle.com/javase/8/docs/technotes/tools/unix/java.html,
> `MaxDirectMemorySize` only sets the maximum amount of memory that can be
> used for direct buffers, thus the amount of off-heap memory used can be
> greater than that value, leading to the container being killed by Mesos
> or Yarn as it exceeds the allocated memory.
> In addition, users might want to allocate off-heap memory through native
> code, in which case they will want to keep some of the container memory
> free and unallocated by Flink.
> To solve this issue, we currently set the following parameter:
> {code:java}
> -Dcontainerized.taskmanager.env.FLINK_ENV_JAVA_OPTS='-XX:MaxDirectMemorySize=600m'
> {code}
> which overrides the value that Flink picks (744M in this case) with a lower 
> one to keep some overhead memory in the TaskManager containers. However this 
> is an "ugly" hack as it goes around the clever memory allocation that Flink 
> performs and allows to bypass the sanity checks done in 
> `ContaineredTaskManagerParameters`.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (FLINK-13477) Containerized TaskManager killed because of lack of memory overhead

Reply via email to