[
https://issues.apache.org/jira/browse/FLINK-9132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Till Rohrmann resolved FLINK-9132.
----------------------------------
Resolution: Won't Fix
This problem should be fixed with a newer Flink version (>= 1.5). Please try
this version and report back if it is not workin.
The community unfortunately no longer supports Flink 1.4.2.
> Cluster runs out of task slots when a job falls into restart loop
> -----------------------------------------------------------------
>
> Key: FLINK-9132
> URL: https://issues.apache.org/jira/browse/FLINK-9132
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.4.2
> Environment: env.java.opts in flink-conf.yaml file:
>
> env.java.opts: -Xloggc:/home/user/flink/log/flinkServer-gc.log -verbose:gc
> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps
> -XX:+UseG1GC -XX:MaxGCPauseMillis=150 -XX:InitiatingHeapOccupancyPercent=55
> -XX:+ParallelRefProcEnabled -XX:ParallelGCThreads=2 -XX:-ResizePLAB
> -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=100M
> Reporter: Alex Smirnov
> Priority: Critical
> Attachments: FailedJob.java, jconsole-classes.png
>
>
> If there's a job which is restarting in a loop, then Task Manager hosting it
> goes down after some time. Job manager automatically assigns the job to
> another Task Manager and the new Task Manager goes down as well. After some
> time, all Task Managers are gone. Cluster becomes paralyzed.
> I've attached to TaskManager's java process using jconsole and noticed that
> number of loaded classes increases dramatically if a job is in restarting
> loop and restores from checkpoint.
> See attachment for the graph with G1GC enabled for the node. Standard GC
> performs even worse - task manager shuts down within 20 minutes since the
> restart loop start.
> I've also attached minimal program to reproduce the problem
>
> please let me know if additional information is required from me.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)