Ryan Scudellari created FLINK-24156:
---------------------------------------
Summary: BlobServer crashes due to SocketTimeoutException in Java
11
Key: FLINK-24156
URL: https://issues.apache.org/jira/browse/FLINK-24156
Project: Flink
Issue Type: Bug
Components: Runtime / Network
Affects Versions: 1.13.2, 1.12.4
Environment: Java 11
CentOS 7.6
Reporter: Ryan Scudellari
h3. Overview
We have seen the BlobServer crash due to a *SocketTimeoutException* while
running on JRE 11. This is likely caused by a [JDK bug present in JDK
11|https://bugs.openjdk.java.net/browse/JDK-8237858] (fixed in version 16) that
erroneously throws _SocketTimeoutException_ when _ServerSocket.accept()_ is
interrupted by any UNIX signal. The BlobServer calls _accept()_ when
establishing connections with clients and is expected to block indefinitely.
[The BlobServer currently shuts down when it catches a
Throwable|https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/blob/BlobServer.java#L267].
We do not see this behavior when running the same steps in JRE 8.
h3. Reproducing the issue
To reproduce, send a _SIGPIPE_ to the BlobServer _PID_. You will need to be
running a Flink cluster on JRE 11 and have tools _jps_ and _jstack_ available
to find the relevant pid.
One-liner:
{code:bash}
kill -SIGPIPE $(jstack $(jps -v | grep StandaloneApplicationClusterEntryPoint |
cut -f 1 -d ' ') | grep BLOB | awk '\{print $13}' | awk -F'[=]' '\{print $2}' |
xargs printf "%d")
{code}
# Run
{code:bash}
jstack [PID] | grep BLOB
{code}
where *PID* is the process ID of the job manager.
# Find the *nid=[HEX]* value and convert the HEX to decimal.
# Run
{code:bash}
kill -SIGPIPE [DNID]
{code}
where *DNID* is the converted decimal value of *HEX nid* from the previous step.
# Observe the following error in the job manager logs:
{noformat}
2021-09-03 09:56:12.517 [BLOB Server listener at 6124] ERROR
org.apache.flink.runtime.blob.BlobServer - BLOB server stopped working.
Shutting down
at java.base/java.net.PlainSocketImpl.socketAccept
at java.base/java.net.AbstractPlainSocketImpl.accept
at java.base/java.net.ServerSocket.implAccept
at java.base/java.net.ServerSocket.accept
at org.apache.flink.runtime.blob.BlobServer.run(BlobServer.java:266)
2021-09-03 09:56:12.527 [BLOB Server listener at 6124] INFO
org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at 0.0.0.0:6124
{noformat}
h3. Proposed Fix
To protect ourselves from this JDK bug, we propose the workaround of catching
_SocketTimeoutException_ and retrying the _ServerSocket.accept()_ call
indefinitely.
Thanks to [~bsanders-wf] for helping track this down.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)