[jira] [Resolved] (FLINK-9132) Cluster runs out of task slots when a job falls into restart loop

Till Rohrmann (JIRA) Fri, 29 Mar 2019 04:46:40 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-9132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Till Rohrmann resolved FLINK-9132.
----------------------------------
    Resolution: Won't Fix

This problem should be fixed with a newer Flink version (>= 1.5). Please try 
this version and report back if it is not workin. 

The community unfortunately no longer supports Flink 1.4.2.

> Cluster runs out of task slots when a job falls into restart loop
> -----------------------------------------------------------------
>
>                 Key: FLINK-9132
>                 URL: https://issues.apache.org/jira/browse/FLINK-9132
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.4.2
>         Environment: env.java.opts in flink-conf.yaml file:
>  
> env.java.opts: -Xloggc:/home/user/flink/log/flinkServer-gc.log  -verbose:gc 
> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps 
> -XX:+UseG1GC -XX:MaxGCPauseMillis=150 -XX:InitiatingHeapOccupancyPercent=55 
> -XX:+ParallelRefProcEnabled -XX:ParallelGCThreads=2 -XX:-ResizePLAB 
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=100M
>            Reporter: Alex Smirnov
>            Priority: Critical
>         Attachments: FailedJob.java, jconsole-classes.png
>
>
> If there's a job which is restarting in a loop, then Task Manager hosting it 
> goes down after some time. Job manager automatically assigns the job to 
> another Task Manager and the new Task Manager goes down as well. After some 
> time, all Task Managers are gone. Cluster becomes paralyzed.
> I've attached to TaskManager's java process using jconsole and noticed that 
> number of loaded classes increases dramatically if a job is in restarting 
> loop and restores from checkpoint.
> See attachment for the graph with G1GC enabled for the node. Standard GC 
> performs even worse - task manager shuts down within 20 minutes since the 
> restart loop start.
> I've also attached minimal program to reproduce the problem
>  
> please let me know if additional information is required from me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (FLINK-9132) Cluster runs out of task slots when a job falls into restart loop

Reply via email to