[ 
https://issues.apache.org/jira/browse/FLINK-19369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251641#comment-17251641
 ] 

Nico Kruber commented on FLINK-19369:
-------------------------------------

It looks like the following code from {{BlobClientTest}} could be problematic:

{code}
                        InputStream is = client.getInternal(jobId, key); // (1)
// ...
                        BlobUtils.readFully(is, receiveBuffer, 0, 
firstChunkLen, null);
                        BlobUtils.readFully(is, receiveBuffer, firstChunkLen, 
firstChunkLen, null); // (2)
// ...
                        for (BlobServerConnection conn : 
getBlobServer().getCurrentActiveConnections()) {
                                conn.close(); // (3)
                        }
// ...
                                BlobUtils.readFully(is, receiveBuffer, 2 * 
firstChunkLen, data.length - 2 * firstChunkLen, null); // (4)
// ...
{code}

I'm not 100% sure, but from the latest instance at 
https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=10212&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=05b74a19-4ee4-5036-c46f-ada307df6cf0,
 the {{BlobServer}}'s connection thread seems to have acquired "locked 
<0x000000008d873170> (a sun.security.ssl.SSLSocketOutputRecord)" while writing 
data out to the socket that was created from the client (1). The client only 
read a few bytes in (2) and then didn't continue (yet) but the socket is still 
open. Now, the test wants to introduce a connection failure and tries to close 
the connection in (3) by letting the {{BlobServer}} close it. However, this 
cannot be done because that requires the same lock inside the JDK's SSL stack 
(0x000000008d873170).

This may lead to a situation where the {{BlobServer}} is blocked and waits to 
write data to the socket because the socket is already full (not sure this 
would also result in a RUNNABLE thread, but {{BlobServerConnection}} is not 
interruptable in this code and {{outputStream.write}} must be a blocking 
operation) while at the same time, the client does not continue reading because 
we want to introduce the failure first and only then continue reading in (4).

I'm not really sure how to fix this without writing code into the 
{{BlobServerConnection}} to introduce the failure there (not really the right 
way for code used only during testing), or rewriting it to use some 
non-blocking IO. If this is just happening with SSL (due to this additional 
lock), then maybe we should just disable these tests ( 
{{testGetFailsDuringStreamingNoJobTransientBlob}}, 
{{testGetFailsDuringStreamingForJobTransientBlob}}, 
{{testGetFailsDuringStreamingForJobPermanentBlob}} ) for {{BlobClientSslTest}} 
and trust that this isn't different with SSL and covered by {{BlobClientTest}} 
already.

> BlobClientTest.testGetFailsDuringStreamingForJobPermanentBlob hangs
> -------------------------------------------------------------------
>
>                 Key: FLINK-19369
>                 URL: https://issues.apache.org/jira/browse/FLINK-19369
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Tests
>    Affects Versions: 1.11.0, 1.12.0
>            Reporter: Dian Fu
>            Priority: Major
>              Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=6803&view=logs&j=f0ac5c25-1168-55a5-07ff-0e88223afed9&t=39a61cac-5c62-532f-d2c1-dea450a66708
> {code}
> 2020-09-22T21:40:57.5304615Z "main" #1 prio=5 os_prio=0 cpu=18407.84ms 
> elapsed=1969.42s tid=0x00007f0730015800 nid=0x79bd waiting for monitor entry  
> [0x00007f07389fb000]
> 2020-09-22T21:40:57.5305080Z    java.lang.Thread.State: BLOCKED (on object 
> monitor)
> 2020-09-22T21:40:57.5305487Z  at 
> sun.security.ssl.SSLSocketImpl.duplexCloseOutput([email protected]/SSLSocketImpl.java:541)
> 2020-09-22T21:40:57.5306159Z  - waiting to lock <0x000000008661a560> (a 
> sun.security.ssl.SSLSocketOutputRecord)
> 2020-09-22T21:40:57.5306545Z  at 
> sun.security.ssl.SSLSocketImpl.close([email protected]/SSLSocketImpl.java:472)
> 2020-09-22T21:40:57.5307045Z  at 
> org.apache.flink.runtime.blob.BlobUtils.closeSilently(BlobUtils.java:367)
> 2020-09-22T21:40:57.5307605Z  at 
> org.apache.flink.runtime.blob.BlobServerConnection.close(BlobServerConnection.java:141)
> 2020-09-22T21:40:57.5308337Z  at 
> org.apache.flink.runtime.blob.BlobClientTest.testGetFailsDuringStreaming(BlobClientTest.java:443)
> 2020-09-22T21:40:57.5308904Z  at 
> org.apache.flink.runtime.blob.BlobClientTest.testGetFailsDuringStreamingForJobPermanentBlob(BlobClientTest.java:408)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to