[ 
https://issues.apache.org/jira/browse/FLINK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173926#comment-17173926
 ] 

Kezhu Wang commented on FLINK-18815:
------------------------------------

I think this is caused by 
{{SafetyNetCloseableRegistry.PhantomDelegatingCloseableRef#close}}. It 
unconditionally closes underlying {{innerCloseable}} without checking whether 
it was closed or unregistered before. Suppose that:
 # At some point, {{innerCloseable}}'s wrapping closeable was finalized and 
become phantom reference, its phantom reference was pushed to 
{{SafetyNetCloseableRegistry#REAPER_THREAD#referenceQueue}}.
 # The closeable registry containing {{innerCloseable}} was closed thus 
{{innerCloseable}} got closed and {{AbstractCloseableRegistry#closeableToRef}} 
got cleared.

 # {{SafetyNetCloseableRegistry#REAPER_THREAD}} executes 
{{SafetyNetCloseableRegistry.PhantomDelegatingCloseableRef#close}} which close 
{{innerCloseable}} unconditionally.

If we change {{Thread.sleep(2)}} from 
{{AbstractCloseableRegistryTest.ProducerThread}} to {{Thread.sleep(0)}}, then 
{{AbstractCloseableRegistryTest#testClose}} fails quite often.

Though {{Closeable#close}} declares itself be idempotent, but it does not 
specify any thread-safe guarantee, so I think this should be treated as bug and 
thus need to fix.

I think this could be fixed by closing 
{{SafetyNetCloseableRegistry.PhantomDelegatingCloseableRef#innerCloseable}} iff 
pair of {{innerCloseable}} and {{PhantomDelegatingCloseableRef}} was 
successfully from {{AbstractCloseableRegistry#closeableToRef}}.

I think this should be sufficient to fix concurrent removing/closing of 
{{Closeable}} in {{SafetyNetCloseableRegistry}} due to following observation:
 # There are only four removing paths: {{SafetyNetCloseableRegistry#close}}, 
{{ClosingFSDataInputStream#close}}, 
{{SafetyNetCloseableRegistry#unregisterCloseable}} and 
{{PhantomDelegatingCloseableRef#close}}.
 # I think the first three paths by design should have no concurrent contention.
 # {{PhantomDelegatingCloseableRef#close}} has no concurrent contention with 
{{SafetyNetCloseableRegistry#unregisterCloseable}} and 
{{ClosingFSDataInputStream#close}} since it is phantom reference of later 
removing/closing stream.

[~rmetzger] [~dianfu] [~aljoscha] [~srichter] Any opinions on this ? If we tend 
to fix this, could I take over it ?

> AbstractCloseableRegistryTest.testClose unstable
> ------------------------------------------------
>
>                 Key: FLINK-18815
>                 URL: https://issues.apache.org/jira/browse/FLINK-18815
>             Project: Flink
>          Issue Type: Bug
>          Components: FileSystems, Tests
>    Affects Versions: 1.12.0
>            Reporter: Robert Metzger
>            Priority: Major
>              Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=5164&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=05b74a19-4ee4-5036-c46f-ada307df6cf0
> {code}
> [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.509 
> s <<< FAILURE! - in org.apache.flink.core.fs.SafetyNetCloseableRegistryTest
> [ERROR] testClose(org.apache.flink.core.fs.SafetyNetCloseableRegistryTest)  
> Time elapsed: 1.15 s  <<< FAILURE!
> java.lang.AssertionError: expected:<0> but was:<-1>
>       at org.junit.Assert.fail(Assert.java:88)
>       at org.junit.Assert.failNotEquals(Assert.java:834)
>       at org.junit.Assert.assertEquals(Assert.java:645)
>       at org.junit.Assert.assertEquals(Assert.java:631)
>       at 
> org.apache.flink.core.fs.AbstractCloseableRegistryTest.testClose(AbstractCloseableRegistryTest.java:93)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to