[jira] [Comment Edited] (FLINK-10928) Job unable to stabilise after restart

Dawid Wysakowicz (JIRA) Mon, 03 Dec 2018 06:59:08 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-10928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707317#comment-16707317
 ]


Dawid Wysakowicz edited comment on FLINK-10928 at 12/3/18 2:57 PM:
-------------------------------------------------------------------

Hi [~djharper]
Where you able to figure out the issue? I would say there are to separate 
problems:
1. Ever growing metaspace size resulting in yarn containers being killed
- could you provide us with a heap dump of your job, so that we could try to 
analyze why the classes are not being GCed?

2. Connection problem that results in job restarts
- I would tackle this problem after resolving the first one.


was (Author: dawidwys):
Hi [~djharper]
Where you able to figure out the issue? I would say there are to separate 
problems:
1. Ever growing metaspace size resulting in yarn containers being killed
- could you provide us with a heap dump of your job, so that we could try to 
analyze why the classes are not being GCed?
2. Connection problem that results in job restarts
- I would tackle this problem after resolving the first one.

> Job unable to stabilise after restart 
> --------------------------------------
>
>                 Key: FLINK-10928
>                 URL: https://issues.apache.org/jira/browse/FLINK-10928
>             Project: Flink
>          Issue Type: Bug
>         Environment: AWS EMR 5.17.0
> FLINK 1.5.2
> BEAM 2.7.0
>            Reporter: Daniel Harper
>            Priority: Major
>         Attachments: Screen Shot 2018-11-16 at 15.49.03.png, Screen Shot 
> 2018-11-16 at 15.49.15.png, 
> ants-CopyofThe'death'spiralincident-191118-1231-1332.pdf
>
>
> We've seen a few instances of this occurring in production now (it's 
> difficult to reproduce) 
> I've attached a timeline of events as a PDF here  
> [^ants-CopyofThe'death'spiralincident-191118-1231-1332.pdf]  but essentially 
> it boils down to
> 1. Job restarts due to exception
> 2. Job restores from a checkpoint but we get the exception
> {code}
> Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: 
> Timeout waiting for connection from pool
> {code}
> 3. Job restarts
> 4. Job restores from a checkpoint but we get the same exception
> .... repeat a few times within 2-3 minutes....
> 5. YARN kills containers with out of memory
> {code}
> 2018-11-14 00:16:04,430 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Closing TaskExecutor connection 
> container_1541433014652_0001_01_000716 because: Container 
> [pid=7725,containerID=container_1541433014652_0001_01_
> 000716] is running beyond physical memory limits. Current usage: 6.4 GB of 
> 6.4 GB physical memory used; 8.4 GB of 31.9 GB virtual memory used. Killing 
> container.
> Dump of the process-tree for container_1541433014652_0001_01_000716 :
>         |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
>         |- 7725 7723 7725 7725 (bash) 0 0 115863552 696 /bin/bash -c 
> /usr/lib/jvm/java-openjdk/bin/java -Xms4995m -Xmx4995m 
> -XX:MaxDirectMemorySize=1533m 
> -Xloggc:/var/log/hadoop-yarn/flink_gc_container_1541433014652_0001_%p.log 
> -XX:GCLogF
> ileSize=200M -XX:NumberOfGCLogFiles=10 -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintGCCause 
> -XX:+PrintGCDateStamps -XX:+UseG1GC 
> -Dlog.file=/var/log/hadoop-yarn/containers/application_1541433014652_00
> 01/container_1541433014652_0001_01_000716/taskmanager.log 
> -Dlog4j.configuration=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1> 
> /var/log/hadoop-yarn/containers/application_1541433014652_0001/container
> _1541433014652_0001_01_000716/taskmanager.out 2> 
> /var/log/hadoop-yarn/containers/application_1541433014652_0001/container_1541433014652_0001_01_000716/taskmanager.err
>         |- 7738 7725 7725 7725 (java) 6959576 976377 8904458240 1671684 
> /usr/lib/jvm/java-openjdk/bin/java -Xms4995m -Xmx4995m 
> -XX:MaxDirectMemorySize=1533m 
> -Xloggc:/var/log/hadoop-yarn/flink_gc_container_1541433014652_0001_%p.log 
> -XX:GCL
> ogFileSize=200M -XX:NumberOfGCLogFiles=10 -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintGCCause 
> -XX:+PrintGCDateStamps -XX:+UseG1GC 
> -Dlog.file=/var/log/hadoop-yarn/containers/application_1541433014652
> _0001/container_1541433014652_0001_01_000716/taskmanager.log 
> -Dlog4j.configuration=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner --configDir .
>  
> Container killed on request. Exit code is 143
> Container exited with a non-zero exit code 143
> {code}
> 6. YARN allocates new containers but the job is never able to get back into a 
> stable state, with constant restarts until eventually the job is cancelled 
> We've seen something similar to FLINK-10848 happening to with some task 
> managers allocated but sitting 'idle' state. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (FLINK-10928) Job unable to stabilise after restart

Reply via email to