[
https://issues.apache.org/jira/browse/FLINK-17470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119316#comment-17119316
]
Robert Metzger commented on FLINK-17470:
----------------------------------------
Just a side note: Flink's end to end tests (which run Flink using the scripts)
are also running on a JVM from Azul, specifically: {{OpenJDK 64-Bit Server VM -
Azul Systems, Inc. - 1.8/25.252-b14}}. I have never observed shutdown
instabilities there.
>From the blog post
>(https://blogs.oracle.com/poonam/hung-jvm-due-to-the-threads-stuck-in-pthreadcondtimedwait)
> you've mentioned on the mailing list it seems that the fix in the linux
>kernel
>(https://github.com/torvalds/linux/commit/76835b0ebf8a7fe85beb03c75121419a7dec52f0)
> is available since 3.18. You are on Linux 3.10. Would you be able to validate
>if this issue still persists on Linux 3.18+ ?
> Flink task executor process permanently hangs on `flink-daemon.sh stop`,
> deletes PID file
> -----------------------------------------------------------------------------------------
>
> Key: FLINK-17470
> URL: https://issues.apache.org/jira/browse/FLINK-17470
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Task
> Affects Versions: 1.10.0
> Environment:
> {code:java}
> $ uname -a
> Linux hostname.local 3.10.0-1062.9.1.el7.x86_64 #1 SMP Fri Dec 6 15:49:49 UTC
> 2019 x86_64 x86_64 x86_64 GNU/Linux
> $ lsb_release -a
> LSB Version: :core-4.1-amd64:core-4.1-noarch
> Distributor ID: CentOS
> Description: CentOS Linux release 7.7.1908 (Core)
> Release: 7.7.1908
> Codename: Core
> {code}
> Flink version 1.10
>
> Reporter: Hunter Herman
> Priority: Major
> Attachments: flink_jstack.log, flink_mixed_jstack.log
>
>
> Hi Flink team!
> We've attempted to upgrade our flink 1.9 cluster to 1.10, but are
> experiencing reproducible instability on shutdown. Speciically, it appears
> that the `kill` issued in the `stop` case of flink-daemon.sh is causing the
> task executor process to hang permanently. Specifically, the process seems to
> be hanging in the
> `org.apache.flink.runtime.util.JvmShutdownSafeguard$DelayedTerminator.run` in
> a `Thread.sleep()` call. I think this is a bizarre behavior. Also note that
> every thread in the process is BLOCKED. on a `pthread_cond_wait` call. Is
> this an OS level issue? Banging my head on a wall here. See attached stack
> traces for details.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)