[ 
https://issues.apache.org/jira/browse/FLINK-16510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17153813#comment-17153813
 ] 

Thomas Weise commented on FLINK-16510:
--------------------------------------

We are not able to reliably run our applications on k8s when pods get stuck 
during termination on a fatal task manager error. When pods don't exit our 
infrastructure cannot replace the task manager and applications cannot recover. 
We have seen this issue many times and we were able to reproduce it with 
benchmarks that produce intermittent OOMs. Based on the analysis from [~mxm] we 
have applied this change to our fork:

[https://github.com/lyft/flink/commit/4787e4d638c5b299164b85e7e492967bf573c400]

We would like to address this issue upstream though. When a fatal error occurs, 
the process should safely terminate. Triggering shutdown hooks is unlikely to 
succeed. It is important that we get a fresh TM deployed to allow for job 
recovery and forward progress (avoid extended downtime and need for manual 
intervention).

Do you see any downside using the hard stop instead of System.exit?

Currently, there are multiple occurrences of System.exit - for everything that 
aims to "exitOnFatalError" it would be nice to centralize. 

> Task manager safeguard shutdown may not be reliable
> ---------------------------------------------------
>
>                 Key: FLINK-16510
>                 URL: https://issues.apache.org/jira/browse/FLINK-16510
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>            Reporter: Maximilian Michels
>            Assignee: Maximilian Michels
>            Priority: Major
>
> The {{JvmShutdownSafeguard}} does not always succeed but can hang when 
> multiple threads attempt to shutdown the JVM. Apparently mixing 
> {{System.exit()}} with ShutdownHooks and forcefully terminating the JVM via 
> {{Runtime.halt()}} does not play together well:
> {noformat}
> "Jvm Terminator" #22 daemon prio=5 os_prio=0 tid=0x00007fb8e82f2800 
> nid=0x5a96 runnable [0x00007fb35cffb000]
>    java.lang.Thread.State: RUNNABLE
>       at java.lang.Shutdown.$$YJP$$halt0(Native Method)
>       at java.lang.Shutdown.halt0(Shutdown.java)
>       at java.lang.Shutdown.halt(Shutdown.java:139)
>       - locked <0x000000047ed67638> (a java.lang.Shutdown$Lock)
>       at java.lang.Runtime.halt(Runtime.java:276)
>       at 
> org.apache.flink.runtime.util.JvmShutdownSafeguard$DelayedTerminator.run(JvmShutdownSafeguard.java:86)
>       at java.lang.Thread.run(Thread.java:748)
>    Locked ownable synchronizers:
>       - None
> "FlinkCompletableFutureDelayScheduler-thread-1" #18154 daemon prio=5 
> os_prio=0 tid=0x00007fb708a7d000 nid=0x5a8a waiting for monitor entry 
> [0x00007fb289d49000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>       at java.lang.Shutdown.halt(Shutdown.java:139)
>       - waiting to lock <0x000000047ed67638> (a java.lang.Shutdown$Lock)
>       at java.lang.Shutdown.exit(Shutdown.java:213)
>       - locked <0x000000047edb7348> (a java.lang.Class for java.lang.Shutdown)
>       at java.lang.Runtime.exit(Runtime.java:110)
>       at java.lang.System.exit(System.java:973)
>       at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.terminateJVM(TaskManagerRunner.java:266)
>       at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$onFatalError$1(TaskManagerRunner.java:260)
>       at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner$$Lambda$27464/1464672548.accept(Unknown
>  Source)
>       at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>       at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>       at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>       at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>       at 
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:943)
>       at 
> org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211)
>       at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$11(FutureUtils.java:361)
>       at 
> org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$27435/159015392.run(Unknown
>  Source)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
>    Locked ownable synchronizers:
>       - <0x00000006d5e56bd0> (a 
> java.util.concurrent.ThreadPoolExecutor$Worker)
> {noformat}
> Note that under this condition the JVM should terminate but it still hangs. 
> Sometimes it quits after several minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to