[
https://issues.apache.org/jira/browse/FLINK-10928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16721569#comment-16721569
]
Daniel Harper commented on FLINK-10928:
---------------------------------------
Hi [~dawidwys]
Going to be tricky to provide a heap dump due to sensitive data unfortunately.
We've resolved the connection timeouts issue by increasing the connection pool
size from 15 to 30, which after running this for 7 days or so has resulted in 0
'timeout waiting for connection from pool' errors when the job restarts and
restores from a checkpoint
One of the causes of the job restarting in the first place is FLINK-10844,
which causes the checkpoint to fail (note this is intermittent we see this once
or twice a day) which causes the job to restart.
We are looking at enabling the setting {{failOnCheckpointingErrors}} to false
to mitigate this in the meantime, although we understand the risk in enabling
this setting.
This 'death spiral'/instability has happened 3 or so times in the past 6 weeks,
and we see the job restarting once or twice a day in the times between these
massive failures. The only thing I can think of is a memory leak building over
time and eventually triggering YARN to kill the containers.
I did a heap dump on one of the taskmanagers this morning and it looks to me
like there are multiple copies of 'user' classes i.e. BEAM code and our code,
most of which have 0 instances, which looks like a classloader leak to me? This
snapshot was taken after the job had restarted about 10 times.
!Screen Shot 2018-12-10 at 14.13.52.png!
> Job unable to stabilise after restart
> --------------------------------------
>
> Key: FLINK-10928
> URL: https://issues.apache.org/jira/browse/FLINK-10928
> Project: Flink
> Issue Type: Bug
> Environment: AWS EMR 5.17.0
> FLINK 1.5.2
> BEAM 2.7.0
> Reporter: Daniel Harper
> Priority: Major
> Attachments: Screen Shot 2018-11-16 at 15.49.03.png, Screen Shot
> 2018-11-16 at 15.49.15.png,
> ants-CopyofThe'death'spiralincident-191118-1231-1332.pdf
>
>
> We've seen a few instances of this occurring in production now (it's
> difficult to reproduce)
> I've attached a timeline of events as a PDF here
> [^ants-CopyofThe'death'spiralincident-191118-1231-1332.pdf] but essentially
> it boils down to
> 1. Job restarts due to exception
> 2. Job restores from a checkpoint but we get the exception
> {code}
> Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request:
> Timeout waiting for connection from pool
> {code}
> 3. Job restarts
> 4. Job restores from a checkpoint but we get the same exception
> .... repeat a few times within 2-3 minutes....
> 5. YARN kills containers with out of memory
> {code}
> 2018-11-14 00:16:04,430 INFO org.apache.flink.yarn.YarnResourceManager
> - Closing TaskExecutor connection
> container_1541433014652_0001_01_000716 because: Container
> [pid=7725,containerID=container_1541433014652_0001_01_
> 000716] is running beyond physical memory limits. Current usage: 6.4 GB of
> 6.4 GB physical memory used; 8.4 GB of 31.9 GB virtual memory used. Killing
> container.
> Dump of the process-tree for container_1541433014652_0001_01_000716 :
> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> |- 7725 7723 7725 7725 (bash) 0 0 115863552 696 /bin/bash -c
> /usr/lib/jvm/java-openjdk/bin/java -Xms4995m -Xmx4995m
> -XX:MaxDirectMemorySize=1533m
> -Xloggc:/var/log/hadoop-yarn/flink_gc_container_1541433014652_0001_%p.log
> -XX:GCLogF
> ileSize=200M -XX:NumberOfGCLogFiles=10 -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintGCCause
> -XX:+PrintGCDateStamps -XX:+UseG1GC
> -Dlog.file=/var/log/hadoop-yarn/containers/application_1541433014652_00
> 01/container_1541433014652_0001_01_000716/taskmanager.log
> -Dlog4j.configuration=file:./log4j.properties
> org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1>
> /var/log/hadoop-yarn/containers/application_1541433014652_0001/container
> _1541433014652_0001_01_000716/taskmanager.out 2>
> /var/log/hadoop-yarn/containers/application_1541433014652_0001/container_1541433014652_0001_01_000716/taskmanager.err
> |- 7738 7725 7725 7725 (java) 6959576 976377 8904458240 1671684
> /usr/lib/jvm/java-openjdk/bin/java -Xms4995m -Xmx4995m
> -XX:MaxDirectMemorySize=1533m
> -Xloggc:/var/log/hadoop-yarn/flink_gc_container_1541433014652_0001_%p.log
> -XX:GCL
> ogFileSize=200M -XX:NumberOfGCLogFiles=10 -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintGCCause
> -XX:+PrintGCDateStamps -XX:+UseG1GC
> -Dlog.file=/var/log/hadoop-yarn/containers/application_1541433014652
> _0001/container_1541433014652_0001_01_000716/taskmanager.log
> -Dlog4j.configuration=file:./log4j.properties
> org.apache.flink.yarn.YarnTaskExecutorRunner --configDir .
>
> Container killed on request. Exit code is 143
> Container exited with a non-zero exit code 143
> {code}
> 6. YARN allocates new containers but the job is never able to get back into a
> stable state, with constant restarts until eventually the job is cancelled
> We've seen something similar to FLINK-10848 happening to with some task
> managers allocated but sitting 'idle' state.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)