[
https://issues.apache.org/jira/browse/FLINK-16510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167743#comment-17167743
]
Maximilian Michels commented on FLINK-16510:
--------------------------------------------
[~trohrmann] I think it is safe to say that the problem is not reproducible
with G1 being disabled. I've run the test for more than 20 hours. Usually the
problem would occurs within 30 minutes, or even faster with more memory
pressure.
Global state can be dangerous if it is mutable (which the config shouldn't be).
I wasn't aware of properties which modify the config but since those are also
global I don't see a reason why we wouldn't be able to resolve them in
GlobalConfiguration.
Preferably, we should resolve the config only once per task manager and
initialize the exception handler at the same place, or pass through the config
value. I'll have a look.
> Task manager safeguard shutdown may not be reliable
> ---------------------------------------------------
>
> Key: FLINK-16510
> URL: https://issues.apache.org/jira/browse/FLINK-16510
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Task
> Reporter: Maximilian Michels
> Assignee: Maximilian Michels
> Priority: Major
> Attachments: command.txt, stack2-1.txt, stack3-mixed.txt, stack3.txt
>
>
> The {{JvmShutdownSafeguard}} does not always succeed but can hang when
> multiple threads attempt to shutdown the JVM. Apparently mixing
> {{System.exit()}} with ShutdownHooks and forcefully terminating the JVM via
> {{Runtime.halt()}} does not play together well:
> {noformat}
> "Jvm Terminator" #22 daemon prio=5 os_prio=0 tid=0x00007fb8e82f2800
> nid=0x5a96 runnable [0x00007fb35cffb000]
> java.lang.Thread.State: RUNNABLE
> at java.lang.Shutdown.$$YJP$$halt0(Native Method)
> at java.lang.Shutdown.halt0(Shutdown.java)
> at java.lang.Shutdown.halt(Shutdown.java:139)
> - locked <0x000000047ed67638> (a java.lang.Shutdown$Lock)
> at java.lang.Runtime.halt(Runtime.java:276)
> at
> org.apache.flink.runtime.util.JvmShutdownSafeguard$DelayedTerminator.run(JvmShutdownSafeguard.java:86)
> at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
> - None
> "FlinkCompletableFutureDelayScheduler-thread-1" #18154 daemon prio=5
> os_prio=0 tid=0x00007fb708a7d000 nid=0x5a8a waiting for monitor entry
> [0x00007fb289d49000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at java.lang.Shutdown.halt(Shutdown.java:139)
> - waiting to lock <0x000000047ed67638> (a java.lang.Shutdown$Lock)
> at java.lang.Shutdown.exit(Shutdown.java:213)
> - locked <0x000000047edb7348> (a java.lang.Class for java.lang.Shutdown)
> at java.lang.Runtime.exit(Runtime.java:110)
> at java.lang.System.exit(System.java:973)
> at
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.terminateJVM(TaskManagerRunner.java:266)
> at
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$onFatalError$1(TaskManagerRunner.java:260)
> at
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner$$Lambda$27464/1464672548.accept(Unknown
> Source)
> at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> at
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:943)
> at
> org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211)
> at
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$11(FutureUtils.java:361)
> at
> org.apache.flink.runtime.concurrent.FutureUtils$$Lambda$27435/159015392.run(Unknown
> Source)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Locked ownable synchronizers:
> - <0x00000006d5e56bd0> (a
> java.util.concurrent.ThreadPoolExecutor$Worker)
> {noformat}
> Note that under this condition the JVM should terminate but it still hangs.
> Sometimes it quits after several minutes.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)