[
https://issues.apache.org/jira/browse/FLINK-18353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17140682#comment-17140682
]
Andrey Zagrebin edited comment on FLINK-18353 at 6/19/20, 5:12 PM:
-------------------------------------------------------------------
I agree we should add this to migration guide if we decide to keep the limit.
The release notes for JM already contain this, we can also add this to TM
releases notes.
Regarding whether to limit direct memory, we could set the limit only if user
explicitly configures the off-heap memory. Then user can set the option to
investigate OOMs. By default, we could account for the off-heap value in total
memory but not set the JVM limit. This is not aligned with TM though, not sure
how big problem it is: expecting that the limit is set by default while
investigating OOMs? - maybe. Maybe we can just align it but the risk for user
code is bigger in TM imo.
The safest option is to keep the limit as already mentioned. On the other hand,
I am ok to remove the limit if we agree that the probability of the leak is
low. Flink does not use direct memory explicitly in JM, only Akka/Netty. The
probability of leak is low there. The user code will probably not use the
direct memory heavily in JM hooks but we cannot control leaks there.
It could be the same for Metaspace but here Flink code is probably the main
consumer of the Metaspace with no leaks. Then the Metaspace size should be
almost the same for all use cases. Therefore, we can keep the limit for safety.
was (Author: azagrebin):
I agree we should add this to migration guide if we decide to keep the limit.
The release notes for JM already contain this, we can also add this to TM
releases notes.
Regarding whether to limit direct memory, we could set the limit only if user
explicitly configures the off-heap memory. Then user can set the option to
investigate OOMs. By default, we could account for the off-heap value in total
memory but not set the JVM limit. This is not aligned with TM though, not sure
how big problem it is: expecting that the limit is set by default while
investigating OOMs? - maybe. Maybe we can just align it.
The safest option is to keep the limit as already mentioned. On the other hand,
I am ok to remove the limit if we agree that the probability of the leak is
low. Flink does not use direct memory explicitly in JM, only Akka/Netty. The
probability of leak is low there. The user code will probably not use the
direct memory heavily in JM hooks but we cannot control leaks there.
It could be the same for Metaspace but here Flink code is probably the main
consumer of the Metaspace with no leaks. Then the Metaspace size should be
almost the same for all use cases. Therefore, we can keep the limit for safety.
> [1.11.0] maybe document jobmanager behavior change regarding
> -XX:MaxDirectMemorySize
> ------------------------------------------------------------------------------------
>
> Key: FLINK-18353
> URL: https://issues.apache.org/jira/browse/FLINK-18353
> Project: Flink
> Issue Type: Improvement
> Components: Documentation, Runtime / Configuration
> Affects Versions: 1.11.0
> Reporter: Steven Zhen Wu
> Priority: Major
>
> I know FLIP-116 (Unified Memory Configuration for Job Managers) is introduced
> in 1.11. That does cause a small behavior change regarding
> `-XX:MaxDirectMemorySize`. Previously, jobmanager don't set JVM arg
> `-XX:MaxDirectMemorySize`, which means JVM can use up to -`Xmx` size for
> direct memory. Now `-XX:MaxDirectMemorySize` is always set to
> [jobmanager.memory.off-heap.size|https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#jobmanager-memory-off-heap-size]
> config (default 128 mb).
>
> {{It is possible for jobmanager to get "java.lang.OufOfMemoryError: Direct
> Buffer Memory" without tuning
> }}{{[jobmanager.memory.off-heap.size|https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#jobmanager-memory-off-heap-size]}}
> especially for high-parallelism jobs. Previously, no tuning needed.
>
> Maybe we should point out the behavior change in the migration guide?
> [https://ci.apache.org/projects/flink/flink-docs-master/ops/memory/mem_migration.html#migrate-job-manager-memory-configuration]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)