[ 
https://issues.apache.org/jira/browse/FLINK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031569#comment-17031569
 ] 

Yang Wang commented on FLINK-15906:
-----------------------------------

Even the taskmanager is killed by Yarn because of running beyond the limit, it 
does not mean that we have memory leak here. Since the heap memory and direct 
memory could be controlled by JVM. Maybe you do not set enough memory for 
native memory. Firstly, i suggest you to increase the jvm-overhead enough to 
make sure the taskmanager is not killed. Then use the native memory tracking[1] 
to debug the memory usage. I think you should find something.

 

[1]. 
[https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr007.html]

> physical memory exceeded causing being killed by yarn
> -----------------------------------------------------
>
>                 Key: FLINK-15906
>                 URL: https://issues.apache.org/jira/browse/FLINK-15906
>             Project: Flink
>          Issue Type: Bug
>            Reporter: liupengcheng
>            Priority: Major
>
> Recently, we encoutered this issue when testing TPCDS query with 100g data. 
> I first meet this issue when I only set the 
> `taskmanager.memory.total-process.size` to `4g` with `-tm` option. Then I try 
> to increase the jvmOverhead size with following arguments, but still failed.
> {code:java}
> taskmanager.memory.jvm-overhead.min: 640m
> taskmanager.memory.jvm-metaspace: 128m
> taskmanager.memory.task.heap.size: 1408m
> taskmanager.memory.framework.heap.size: 128m
> taskmanager.memory.framework.off-heap.size: 128m
> taskmanager.memory.managed.size: 1408m
> taskmanager.memory.shuffle.max: 256m
> {code}
> {code:java}
> java.lang.Exception: [2020-02-05 11:31:32.345]Container 
> [pid=101677,containerID=container_e08_1578903621081_4785_01_000051] is 
> running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB 
> of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing 
> container.java.lang.Exception: [2020-02-05 11:31:32.345]Container 
> [pid=101677,containerID=container_e08_1578903621081_4785_01_000051] is 
> running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB 
> of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing 
> container.Dump of the process-tree for 
> container_e08_1578903621081_4785_01_000051 : |- PID PPID PGRPID SESSID 
> CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) 
> RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 101938 101677 101677 101677 (java) 25762 
> 3571 18867417088 1059157 /opt/soft/openjdk1.8.0/bin/java 
> -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 
> -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 
> -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_000051/taskmanager.log
>  -Dlog4j.configuration=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner -D 
> taskmanager.memory.shuffle.max=268435456b -D 
> taskmanager.memory.framework.off-heap.size=134217728b -D 
> taskmanager.memory.framework.heap.size=134217728b -D 
> taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D 
> taskmanager.memory.task.heap.size=1476395008b -D 
> taskmanager.memory.task.off-heap.size=0b -D 
> taskmanager.memory.shuffle.min=268435456b --configDir . 
> -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 
> -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb 
> -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b 
> -Drest.address=zjy-hadoop-prc-st2805.bj |- 101677 101671 101677 101677 (bash) 
> 1 1 118030336 733 /bin/bash -c /opt/soft/openjdk1.8.0/bin/java 
> -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 
> -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 
> -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_000051/taskmanager.log
>  -Dlog4j.configuration=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner -D 
> taskmanager.memory.shuffle.max=268435456b -D 
> taskmanager.memory.framework.off-heap.size=134217728b -D 
> taskmanager.memory.framework.heap.size=134217728b -D 
> taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D 
> taskmanager.memory.task.heap.size=1476395008b -D 
> taskmanager.memory.task.off-heap.size=0b -D 
> taskmanager.memory.shuffle.min=268435456b --configDir . 
> -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 
> -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb 
> -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b 
> -Drest.address=zjy-hadoop-prc-st2805.bj 1> 
> /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_000051/taskmanager.out
>  2> 
> /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_000051/taskmanager.err
> {code}
> I suspect there are some leaks or unexpected offheap memory usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to