[
https://issues.apache.org/jira/browse/FLINK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17242974#comment-17242974
]
Xintong Song commented on FLINK-15906:
--------------------------------------
The exception suggests that the task manager is consuming more memory than
expected.
A java program may consume various types of memory: heap, direct, native,
metaspace. For all the types, except for native memory, Flink sets explicit
upper limits via JVM parameters, so that an `OutOfMemoryError` will be thrown
if the process tries to use more memory than the limit. Since there's no OOM
thrown, the only possibility is that Flink uses more native memory than it
planned.
Increasing JVM overhead, Flink will reserve more native memory in the
container. The extra memory may not be actually used by JVM as its overhead,
but should help with your problem.
BTW, did it solves your problem?
> physical memory exceeded causing being killed by yarn
> -----------------------------------------------------
>
> Key: FLINK-15906
> URL: https://issues.apache.org/jira/browse/FLINK-15906
> Project: Flink
> Issue Type: Bug
> Components: Deployment / YARN
> Reporter: liupengcheng
> Priority: Major
>
> Recently, we encoutered this issue when testing TPCDS query with 100g data.
> I first meet this issue when I only set the
> `taskmanager.memory.total-process.size` to `4g` with `-tm` option. Then I try
> to increase the jvmOverhead size with following arguments, but still failed.
> {code:java}
> taskmanager.memory.jvm-overhead.min: 640m
> taskmanager.memory.jvm-metaspace: 128m
> taskmanager.memory.task.heap.size: 1408m
> taskmanager.memory.framework.heap.size: 128m
> taskmanager.memory.framework.off-heap.size: 128m
> taskmanager.memory.managed.size: 1408m
> taskmanager.memory.shuffle.max: 256m
> {code}
> {code:java}
> java.lang.Exception: [2020-02-05 11:31:32.345]Container
> [pid=101677,containerID=container_e08_1578903621081_4785_01_000051] is
> running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB
> of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing
> container.java.lang.Exception: [2020-02-05 11:31:32.345]Container
> [pid=101677,containerID=container_e08_1578903621081_4785_01_000051] is
> running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB
> of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing
> container.Dump of the process-tree for
> container_e08_1578903621081_4785_01_000051 : |- PID PPID PGRPID SESSID
> CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES)
> RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 101938 101677 101677 101677 (java) 25762
> 3571 18867417088 1059157 /opt/soft/openjdk1.8.0/bin/java
> -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736
> -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728
> -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_000051/taskmanager.log
> -Dlog4j.configuration=file:./log4j.properties
> org.apache.flink.yarn.YarnTaskExecutorRunner -D
> taskmanager.memory.shuffle.max=268435456b -D
> taskmanager.memory.framework.off-heap.size=134217728b -D
> taskmanager.memory.framework.heap.size=134217728b -D
> taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D
> taskmanager.memory.task.heap.size=1476395008b -D
> taskmanager.memory.task.off-heap.size=0b -D
> taskmanager.memory.shuffle.min=268435456b --configDir .
> -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0
> -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb
> -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b
> -Drest.address=zjy-hadoop-prc-st2805.bj |- 101677 101671 101677 101677 (bash)
> 1 1 118030336 733 /bin/bash -c /opt/soft/openjdk1.8.0/bin/java
> -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736
> -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728
> -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_000051/taskmanager.log
> -Dlog4j.configuration=file:./log4j.properties
> org.apache.flink.yarn.YarnTaskExecutorRunner -D
> taskmanager.memory.shuffle.max=268435456b -D
> taskmanager.memory.framework.off-heap.size=134217728b -D
> taskmanager.memory.framework.heap.size=134217728b -D
> taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D
> taskmanager.memory.task.heap.size=1476395008b -D
> taskmanager.memory.task.off-heap.size=0b -D
> taskmanager.memory.shuffle.min=268435456b --configDir .
> -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0
> -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb
> -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b
> -Drest.address=zjy-hadoop-prc-st2805.bj 1>
> /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_000051/taskmanager.out
> 2>
> /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_000051/taskmanager.err
> {code}
> I suspect there are some leaks or unexpected offheap memory usage.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)