[ 
https://issues.apache.org/jira/browse/FLINK-16510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17168099#comment-17168099
 ] 

Maximilian Michels commented on FLINK-16510:
--------------------------------------------

I ran another test with the existing {{taskmanager.jvm-exit-on-oom}} option 
enabled which does a halt in the task thread on OOM errors. So far it looks 
like the pods get killed as expected. Apparently, the odds are low that the OOM 
errors occurs outside the Task thread. Still, it is possible that we get the 
error elsewhere in the TaskManager. I wonder whether we should deprecate this 
option in favor of a more general {{taskmanager.jvm-exit-on-fatal-error}} which 
would include OOM errors? The alternative would be to add it as an additional 
option.

> Task manager safeguard shutdown may not be reliable
> ---------------------------------------------------
>
>                 Key: FLINK-16510
>                 URL: https://issues.apache.org/jira/browse/FLINK-16510
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>            Reporter: Maximilian Michels
>            Assignee: Maximilian Michels
>            Priority: Major
>         Attachments: command.txt, stack2-1.txt, stack3-mixed.txt, stack3.txt
>
>
> The {{JvmShutdownSafeguard}} does not always succeed but can hang when 
> multiple threads attempt to shutdown the JVM. Apparently mixing 
> {{System.exit()}} with ShutdownHooks and forcefully terminating the JVM via 
> {{Runtime.halt()}} does not play together well:
> {noformat}
> "Jvm Terminator" #22 daemon prio=5 os_prio=0 tid=0x00007fb8e82f2800 
> nid=0x5a96 runnable [0x00007fb35cffb000]
>    java.lang.Thread.State: RUNNABLE
>       at java.lang.Shutdown.$$YJP$$halt0(Native Method)
>       at java.lang.Shutdown.halt0(Shutdown.java)
>       at java.lang.Shutdown.halt(Shutdown.java:139)
>       - locked <0x000000047ed67638> (a java.lang.Shutdown$Lock)
>       at java.lang.Runtime.halt(Runtime.java:276)
>       at 
> org.apache.flink.runtime.util.JvmShutdownSafeguard$DelayedTerminator.run(JvmShutdownSafeguard.java:86)
>       at java.lang.Thread.run(Thread.java:748)
>    Locked ownable synchronizers:
>       - None
> "FlinkCompletableFutureDelayScheduler-thread-1" #18154 daemon prio=5 
> os_prio=0 tid=0x00007fb708a7d000 nid=0x5a8a waiting for monitor entry 
> [0x00007fb289d49000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>       at java.lang.Shutdown.halt(Shutdown.java:139)
>       - waiting to lock <0x000000047ed67638> (a java.lang.Shutdown$Lock)
>       at java.lang.Shutdown.exit(Shutdown.java:213)
>       - locked <0x000000047edb7348> (a java.lang.Class for java.lang.Shutdown)
>       at java.lang.Runtime.exit(Runtime.java:110)
>       at java.lang.System.exit(System.java:973)
>       at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.terminateJVM(TaskManagerRunner.java:266)
>       at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$onFatalError$1(TaskManagerRunner.java:260)
>       at 
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner$$Lambda$27464/1464672548.accept(Unknown
>  Source)
>       at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
>       at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
>       at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>       at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>       at 
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:943)
>       at 
> org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211)
>       at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$11(FutureUtils.java:361)
>       at 
> org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$27435/159015392.run(Unknown
>  Source)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
>    Locked ownable synchronizers:
>       - <0x00000006d5e56bd0> (a 
> java.util.concurrent.ThreadPoolExecutor$Worker)
> {noformat}
> Note that under this condition the JVM should terminate but it still hangs. 
> Sometimes it quits after several minutes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to