[
https://issues.apache.org/jira/browse/FLINK-17470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142822#comment-17142822
]
Stephan Ewen commented on FLINK-17470:
--------------------------------------
I think one think you can do is change the {{"flink-daemon.sh stop"}} command
to use SIGKILL instead of SIGTERM.
That should work by avoiding all types of shutdown hooks. It will be not as
graceful of an exit, tough. For example, Flink won't clean up its temp
directory itself, so you need to eventually clean that up (unless your system
environment takes care of that eventually).
An advanced version, you could try to extend that script to send a SIGTERM and
a SIGKILL some X seconds later, if the process still exists. That's what Flink
does internally but gets hung up due to the JVM/Kernel issue. I don't know from
the top of my head how to best do that in bash, though.
> Flink task executor process permanently hangs on `flink-daemon.sh stop`,
> deletes PID file
> -----------------------------------------------------------------------------------------
>
> Key: FLINK-17470
> URL: https://issues.apache.org/jira/browse/FLINK-17470
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Task
> Affects Versions: 1.10.0
> Environment:
> {code:java}
> $ uname -a
> Linux hostname.local 3.10.0-1062.9.1.el7.x86_64 #1 SMP Fri Dec 6 15:49:49 UTC
> 2019 x86_64 x86_64 x86_64 GNU/Linux
> $ lsb_release -a
> LSB Version: :core-4.1-amd64:core-4.1-noarch
> Distributor ID: CentOS
> Description: CentOS Linux release 7.7.1908 (Core)
> Release: 7.7.1908
> Codename: Core
> {code}
> Flink version 1.10
>
> Reporter: Hunter Herman
> Priority: Major
> Attachments: flink_jstack.log, flink_mixed_jstack.log
>
>
> Hi Flink team!
> We've attempted to upgrade our flink 1.9 cluster to 1.10, but are
> experiencing reproducible instability on shutdown. Speciically, it appears
> that the `kill` issued in the `stop` case of flink-daemon.sh is causing the
> task executor process to hang permanently. Specifically, the process seems to
> be hanging in the
> `org.apache.flink.runtime.util.JvmShutdownSafeguard$DelayedTerminator.run` in
> a `Thread.sleep()` call. I think this is a bizarre behavior. Also note that
> every thread in the process is BLOCKED. on a `pthread_cond_wait` call. Is
> this an OS level issue? Banging my head on a wall here. See attached stack
> traces for details.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)