[jira] [Comment Edited] (FLINK-10928) Job unable to stabilise after restart

Biao Liu (JIRA) Mon, 19 Nov 2018 20:09:15 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-10928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692618#comment-16692618
 ]


Biao Liu edited comment on FLINK-10928 at 11/20/18 4:08 AM:
------------------------------------------------------------

Hi [~djharper]

1. "Why does YARN kill the containers with out of memory?"

The reason is described clearly in exception. 
{code:java}
Container [pid=7725,containerID=container_1541433014652_0001_01_
 000716] is running beyond physical memory limits. Current usage: 6.4 GB of 6.4 
GB physical memory used; 8.4 GB of 31.9 GB virtual memory used. Killing 
container.{code}

 Your container is beyond physical memory limits. Not because OOM, OOM may 
cause failure, but not being killed.

 

2. "Is it possible for the task manager to allocate memory outside of the 'off 
heap' allocation, which would cause YARN to kill the container?"

Yes, it is possible. JVM, state backend, Netty, all these components may 
allocate off heap memory or native memory.

 

3. "Why do we get timeout waiting for connection from pool from the AWS SDK?"

I'm not sure because I can't see the whole picture of your job. However there 
is a "FileNotFoundException" which is thrown by user code. I think that's not 
caused by Flink, right?

 
{code:java}
Caused by: org.apache.beam.sdk.util.UserCodeException: 
java.io.FileNotFoundException: Reopen at position 0 on 
s3a://.../beam/.temp-beam-2018-11-05_15-54-26-0/bc47b14b-1679-45ce-81b7- 
a4d19e036cb5: com.amazonaws.services.s3.model.AmazonS3Exception: The specified 
key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: 
NoSuchKey; Request ID: 0D67ACD1037E5B52; S3 Extended Request ID: 
BVgqzksS75Dv1EkZyUgkVMl8brE1PznBM1RsN9uXp2cnn8Rf+r+b9D09TWZQtpW8aSbQi7R9 RW8=), 
S3 Extended Request ID: 
BVgqzksS75Dv1EkZyUgkVMl8brE1PznBM1RsN9uXp2cnn8Rf+r+b9D09TWZQtpW8aSbQi7R9 RW8=
{code}
 

There are too many problems in your description. Most of them seem to be 
nothing related with Flink framework. 

Could you fix the memory and the FileNotFoundException first? 

 

And also I think this should be answered in Flink user mailing list not here.

 


was (Author: sleepy):
Hi [~djharper]

1. "Why does YARN kill the containers with out of memory?"

The reason is described clearly in exception. 
Container [pid=7725,containerID=container_1541433014652_0001_01_
000716] is running beyond physical memory limits. Current usage: 6.4 GB of 6.4 
GB physical memory used; 8.4 GB of 31.9 GB virtual memory used. Killing 
container.
Your container is beyond physical memory limits. Not because OOM, OOM may cause 
failure, but not being killed.

 

2. "Is it possible for the task manager to allocate memory outside of the 'off 
heap' allocation, which would cause YARN to kill the container?"

Yes, it is possible. JVM, state backend, Netty, all these components may 
allocate off heap memory or native memory.

 

3. "Why do we get timeout waiting for connection from pool from the AWS SDK?"

I'm not sure because I can't see the whole picture of your job. However there 
is a "FileNotFoundException" which is thrown by user code. I think that's not 
caused by Flink, right?

 
{code:java}
Caused by: org.apache.beam.sdk.util.UserCodeException: 
java.io.FileNotFoundException: Reopen at position 0 on 
s3a://.../beam/.temp-beam-2018-11-05_15-54-26-0/bc47b14b-1679-45ce-81b7- 
a4d19e036cb5: com.amazonaws.services.s3.model.AmazonS3Exception: The specified 
key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: 
NoSuchKey; Request ID: 0D67ACD1037E5B52; S3 Extended Request ID: 
BVgqzksS75Dv1EkZyUgkVMl8brE1PznBM1RsN9uXp2cnn8Rf+r+b9D09TWZQtpW8aSbQi7R9 RW8=), 
S3 Extended Request ID: 
BVgqzksS75Dv1EkZyUgkVMl8brE1PznBM1RsN9uXp2cnn8Rf+r+b9D09TWZQtpW8aSbQi7R9 RW8=
{code}
 

There are too many problems in your description. Most of them seem to be 
nothing related with Flink framework. 

Could you fix the memory and the FileNotFoundException first? 

 

> Job unable to stabilise after restart 
> --------------------------------------
>
>                 Key: FLINK-10928
>                 URL: https://issues.apache.org/jira/browse/FLINK-10928
>             Project: Flink
>          Issue Type: Bug
>         Environment: AWS EMR 5.17.0
> FLINK 1.5.2
> BEAM 2.7.0
>            Reporter: Daniel Harper
>            Priority: Major
>         Attachments: Screen Shot 2018-11-16 at 15.49.03.png, Screen Shot 
> 2018-11-16 at 15.49.15.png, 
> ants-CopyofThe'death'spiralincident-191118-1231-1332.pdf
>
>
> We've seen a few instances of this occurring in production now (it's 
> difficult to reproduce) 
> I've attached a timeline of events as a PDF here  
> [^ants-CopyofThe'death'spiralincident-191118-1231-1332.pdf]  but essentially 
> it boils down to
> 1. Job restarts due to exception
> 2. Job restores from a checkpoint but we get the exception
> {code}
> Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: 
> Timeout waiting for connection from pool
> {code}
> 3. Job restarts
> 4. Job restores from a checkpoint but we get the same exception
> .... repeat a few times within 2-3 minutes....
> 5. YARN kills containers with out of memory
> {code}
> 2018-11-14 00:16:04,430 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Closing TaskExecutor connection 
> container_1541433014652_0001_01_000716 because: Container 
> [pid=7725,containerID=container_1541433014652_0001_01_
> 000716] is running beyond physical memory limits. Current usage: 6.4 GB of 
> 6.4 GB physical memory used; 8.4 GB of 31.9 GB virtual memory used. Killing 
> container.
> Dump of the process-tree for container_1541433014652_0001_01_000716 :
>         |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
>         |- 7725 7723 7725 7725 (bash) 0 0 115863552 696 /bin/bash -c 
> /usr/lib/jvm/java-openjdk/bin/java -Xms4995m -Xmx4995m 
> -XX:MaxDirectMemorySize=1533m 
> -Xloggc:/var/log/hadoop-yarn/flink_gc_container_1541433014652_0001_%p.log 
> -XX:GCLogF
> ileSize=200M -XX:NumberOfGCLogFiles=10 -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintGCCause 
> -XX:+PrintGCDateStamps -XX:+UseG1GC 
> -Dlog.file=/var/log/hadoop-yarn/containers/application_1541433014652_00
> 01/container_1541433014652_0001_01_000716/taskmanager.log 
> -Dlog4j.configuration=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner --configDir . 1> 
> /var/log/hadoop-yarn/containers/application_1541433014652_0001/container
> _1541433014652_0001_01_000716/taskmanager.out 2> 
> /var/log/hadoop-yarn/containers/application_1541433014652_0001/container_1541433014652_0001_01_000716/taskmanager.err
>         |- 7738 7725 7725 7725 (java) 6959576 976377 8904458240 1671684 
> /usr/lib/jvm/java-openjdk/bin/java -Xms4995m -Xmx4995m 
> -XX:MaxDirectMemorySize=1533m 
> -Xloggc:/var/log/hadoop-yarn/flink_gc_container_1541433014652_0001_%p.log 
> -XX:GCL
> ogFileSize=200M -XX:NumberOfGCLogFiles=10 -XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution -XX:+PrintGCCause 
> -XX:+PrintGCDateStamps -XX:+UseG1GC 
> -Dlog.file=/var/log/hadoop-yarn/containers/application_1541433014652
> _0001/container_1541433014652_0001_01_000716/taskmanager.log 
> -Dlog4j.configuration=file:./log4j.properties 
> org.apache.flink.yarn.YarnTaskExecutorRunner --configDir .
>  
> Container killed on request. Exit code is 143
> Container exited with a non-zero exit code 143
> {code}
> 6. YARN allocates new containers but the job is never able to get back into a 
> stable state, with constant restarts until eventually the job is cancelled 
> We've seen something similar to FLINK-10848 happening to with some task 
> managers allocated but sitting 'idle' state. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (FLINK-10928) Job unable to stabilise after restart

Reply via email to