[
https://issues.apache.org/jira/browse/FLINK-19369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251641#comment-17251641
]
Nico Kruber commented on FLINK-19369:
-------------------------------------
It looks like the following code from {{BlobClientTest}} could be problematic:
{code}
InputStream is = client.getInternal(jobId, key); // (1)
// ...
BlobUtils.readFully(is, receiveBuffer, 0,
firstChunkLen, null);
BlobUtils.readFully(is, receiveBuffer, firstChunkLen,
firstChunkLen, null); // (2)
// ...
for (BlobServerConnection conn :
getBlobServer().getCurrentActiveConnections()) {
conn.close(); // (3)
}
// ...
BlobUtils.readFully(is, receiveBuffer, 2 *
firstChunkLen, data.length - 2 * firstChunkLen, null); // (4)
// ...
{code}
I'm not 100% sure, but from the latest instance at
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=10212&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=05b74a19-4ee4-5036-c46f-ada307df6cf0,
the {{BlobServer}}'s connection thread seems to have acquired "locked
<0x000000008d873170> (a sun.security.ssl.SSLSocketOutputRecord)" while writing
data out to the socket that was created from the client (1). The client only
read a few bytes in (2) and then didn't continue (yet) but the socket is still
open. Now, the test wants to introduce a connection failure and tries to close
the connection in (3) by letting the {{BlobServer}} close it. However, this
cannot be done because that requires the same lock inside the JDK's SSL stack
(0x000000008d873170).
This may lead to a situation where the {{BlobServer}} is blocked and waits to
write data to the socket because the socket is already full (not sure this
would also result in a RUNNABLE thread, but {{BlobServerConnection}} is not
interruptable in this code and {{outputStream.write}} must be a blocking
operation) while at the same time, the client does not continue reading because
we want to introduce the failure first and only then continue reading in (4).
I'm not really sure how to fix this without writing code into the
{{BlobServerConnection}} to introduce the failure there (not really the right
way for code used only during testing), or rewriting it to use some
non-blocking IO. If this is just happening with SSL (due to this additional
lock), then maybe we should just disable these tests (
{{testGetFailsDuringStreamingNoJobTransientBlob}},
{{testGetFailsDuringStreamingForJobTransientBlob}},
{{testGetFailsDuringStreamingForJobPermanentBlob}} ) for {{BlobClientSslTest}}
and trust that this isn't different with SSL and covered by {{BlobClientTest}}
already.
> BlobClientTest.testGetFailsDuringStreamingForJobPermanentBlob hangs
> -------------------------------------------------------------------
>
> Key: FLINK-19369
> URL: https://issues.apache.org/jira/browse/FLINK-19369
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination, Tests
> Affects Versions: 1.11.0, 1.12.0
> Reporter: Dian Fu
> Priority: Major
> Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=6803&view=logs&j=f0ac5c25-1168-55a5-07ff-0e88223afed9&t=39a61cac-5c62-532f-d2c1-dea450a66708
> {code}
> 2020-09-22T21:40:57.5304615Z "main" #1 prio=5 os_prio=0 cpu=18407.84ms
> elapsed=1969.42s tid=0x00007f0730015800 nid=0x79bd waiting for monitor entry
> [0x00007f07389fb000]
> 2020-09-22T21:40:57.5305080Z java.lang.Thread.State: BLOCKED (on object
> monitor)
> 2020-09-22T21:40:57.5305487Z at
> sun.security.ssl.SSLSocketImpl.duplexCloseOutput([email protected]/SSLSocketImpl.java:541)
> 2020-09-22T21:40:57.5306159Z - waiting to lock <0x000000008661a560> (a
> sun.security.ssl.SSLSocketOutputRecord)
> 2020-09-22T21:40:57.5306545Z at
> sun.security.ssl.SSLSocketImpl.close([email protected]/SSLSocketImpl.java:472)
> 2020-09-22T21:40:57.5307045Z at
> org.apache.flink.runtime.blob.BlobUtils.closeSilently(BlobUtils.java:367)
> 2020-09-22T21:40:57.5307605Z at
> org.apache.flink.runtime.blob.BlobServerConnection.close(BlobServerConnection.java:141)
> 2020-09-22T21:40:57.5308337Z at
> org.apache.flink.runtime.blob.BlobClientTest.testGetFailsDuringStreaming(BlobClientTest.java:443)
> 2020-09-22T21:40:57.5308904Z at
> org.apache.flink.runtime.blob.BlobClientTest.testGetFailsDuringStreamingForJobPermanentBlob(BlobClientTest.java:408)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)