[jira] [Commented] (FLINK-15906) physical memory exceeded causing being killed by yarn

Xintong Song (Jira) Wed, 05 Feb 2020 19:51:41 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17031251#comment-17031251
 ]


Xintong Song commented on FLINK-15906:
--------------------------------------

The launching command shows that the calculated memory sizes and JVM arguments 
aligns with the configuration you provided.

[~lzljs3620320], could you share some experience here? Have you ever run into 
such problems when running TPC-DS? And what are your memory configurations when 
running TPC-DS?

[~liupengcheng], I have a few further questions.
- How often do you run into this issue? Always, frequently, occasionally, or 
just once? 
- When exactly does this happen? Is it right after the cluster is started, 
right after starting to process TPC-DS queries, running the queries for a 
while, or always happen when executing some specific the queries?

> physical memory exceeded causing being killed by yarn
> -----------------------------------------------------
>
>                 Key: FLINK-15906
>                 URL: https://issues.apache.org/jira/browse/FLINK-15906
>             Project: Flink
>          Issue Type: Bug
>            Reporter: liupengcheng
>            Priority: Major
>
> Recently, we encoutered this issue when testing TPCDS query with 100g data. 
> I first meet this issue when I only set the 
> `taskmanager.memory.total-process.size` to `4g` with `-tm` option. Then I try 
> to increase the jvmOverhead size with following arguments, but still failed.
> {code:java}
> taskmanager.memory.jvm-overhead.min: 640m
> taskmanager.memory.jvm-metaspace: 128m
> taskmanager.memory.task.heap.size: 1408m
> taskmanager.memory.framework.heap.size: 128m
> taskmanager.memory.framework.off-heap.size: 128m
> taskmanager.memory.managed.size: 1408m
> taskmanager.memory.shuffle.max: 256m
> {code}
> {code:java}
> java.lang.Exception: [2020-02-05 11:31:32.345]Container 
> [pid=101677,containerID=container_e08_1578903621081_4785_01_000051] is 
> running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB 
> of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing 
> container.java.lang.Exception: [2020-02-05 11:31:32.345]Container 
> [pid=101677,containerID=container_e08_1578903621081_4785_01_000051] is 
> running 46342144B beyond the 'PHYSICAL' memory limit. Current usage: 4.04 GB 
> of 4 GB physical memory used; 17.68 GB of 40 GB virtual memory used. Killing 
> container.Dump of the process-tree for 
> container_e08_1578903621081_4785_01_000051 : |- PID PPID PGRPID SESSID 
> CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) 
> RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 101938 101677 101677 101677 (java) 25762 
> 3571 18867417088 1059157 /opt/soft/openjdk1.8.0/bin/java 
> -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 
> -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 
> -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_000051/taskmanager.log
>  -Dlog4j.configuration=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner -D 
> taskmanager.memory.shuffle.max=268435456b -D 
> taskmanager.memory.framework.off-heap.size=134217728b -D 
> taskmanager.memory.framework.heap.size=134217728b -D 
> taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D 
> taskmanager.memory.task.heap.size=1476395008b -D 
> taskmanager.memory.task.off-heap.size=0b -D 
> taskmanager.memory.shuffle.min=268435456b --configDir . 
> -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 
> -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb 
> -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b 
> -Drest.address=zjy-hadoop-prc-st2805.bj |- 101677 101671 101677 101677 (bash) 
> 1 1 118030336 733 /bin/bash -c /opt/soft/openjdk1.8.0/bin/java 
> -Dhadoop.root.logfile=syslog -Xmx1610612736 -Xms1610612736 
> -XX:MaxDirectMemorySize=402653184 -XX:MaxMetaspaceSize=134217728 
> -Dlog.file=/home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_000051/taskmanager.log
>  -Dlog4j.configuration=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner -D 
> taskmanager.memory.shuffle.max=268435456b -D 
> taskmanager.memory.framework.off-heap.size=134217728b -D 
> taskmanager.memory.framework.heap.size=134217728b -D 
> taskmanager.memory.managed.size=1476395008b -D taskmanager.cpu.cores=1.0 -D 
> taskmanager.memory.task.heap.size=1476395008b -D 
> taskmanager.memory.task.off-heap.size=0b -D 
> taskmanager.memory.shuffle.min=268435456b --configDir . 
> -Djobmanager.rpc.address=zjy-hadoop-prc-st2805.bj -Dweb.port=0 
> -Dweb.tmpdir=/tmp/flink-web-4bf6cd3a-a6e1-4b46-b140-b8ac7bdffbeb 
> -Djobmanager.rpc.port=36769 -Dtaskmanager.memory.managed.size=1476395008b 
> -Drest.address=zjy-hadoop-prc-st2805.bj 1> 
> /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_000051/taskmanager.out
>  2> 
> /home/work/hdd5/yarn/zjyprc-analysis/nodemanager/application_1578903621081_4785/container_e08_1578903621081_4785_01_000051/taskmanager.err
> {code}
> I suspect there are some leaks or unexpected offheap memory usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-15906) physical memory exceeded causing being killed by yarn

Reply via email to