[ https://issues.apache.org/jira/browse/FLINK-18815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173926#comment-17173926 ]
Kezhu Wang commented on FLINK-18815: ------------------------------------ I think this is caused by {{SafetyNetCloseableRegistry.PhantomDelegatingCloseableRef#close}}. It unconditionally closes underlying {{innerCloseable}} without checking whether it was closed or unregistered before. Suppose that: # At some point, {{innerCloseable}}'s wrapping closeable was finalized and become phantom reference, its phantom reference was pushed to {{SafetyNetCloseableRegistry#REAPER_THREAD#referenceQueue}}. # The closeable registry containing {{innerCloseable}} was closed thus {{innerCloseable}} got closed and {{AbstractCloseableRegistry#closeableToRef}} got cleared. # {{SafetyNetCloseableRegistry#REAPER_THREAD}} executes {{SafetyNetCloseableRegistry.PhantomDelegatingCloseableRef#close}} which close {{innerCloseable}} unconditionally. If we change {{Thread.sleep(2)}} from {{AbstractCloseableRegistryTest.ProducerThread}} to {{Thread.sleep(0)}}, then {{AbstractCloseableRegistryTest#testClose}} fails quite often. Though {{Closeable#close}} declares itself be idempotent, but it does not specify any thread-safe guarantee, so I think this should be treated as bug and thus need to fix. I think this could be fixed by closing {{SafetyNetCloseableRegistry.PhantomDelegatingCloseableRef#innerCloseable}} iff pair of {{innerCloseable}} and {{PhantomDelegatingCloseableRef}} was successfully from {{AbstractCloseableRegistry#closeableToRef}}. I think this should be sufficient to fix concurrent removing/closing of {{Closeable}} in {{SafetyNetCloseableRegistry}} due to following observation: # There are only four removing paths: {{SafetyNetCloseableRegistry#close}}, {{ClosingFSDataInputStream#close}}, {{SafetyNetCloseableRegistry#unregisterCloseable}} and {{PhantomDelegatingCloseableRef#close}}. # I think the first three paths by design should have no concurrent contention. # {{PhantomDelegatingCloseableRef#close}} has no concurrent contention with {{SafetyNetCloseableRegistry#unregisterCloseable}} and {{ClosingFSDataInputStream#close}} since it is phantom reference of later removing/closing stream. [~rmetzger] [~dianfu] [~aljoscha] [~srichter] Any opinions on this ? If we tend to fix this, could I take over it ? > AbstractCloseableRegistryTest.testClose unstable > ------------------------------------------------ > > Key: FLINK-18815 > URL: https://issues.apache.org/jira/browse/FLINK-18815 > Project: Flink > Issue Type: Bug > Components: FileSystems, Tests > Affects Versions: 1.12.0 > Reporter: Robert Metzger > Priority: Major > Labels: test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=5164&view=logs&j=0da23115-68bb-5dcd-192c-bd4c8adebde1&t=05b74a19-4ee4-5036-c46f-ada307df6cf0 > {code} > [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.509 > s <<< FAILURE! - in org.apache.flink.core.fs.SafetyNetCloseableRegistryTest > [ERROR] testClose(org.apache.flink.core.fs.SafetyNetCloseableRegistryTest) > Time elapsed: 1.15 s <<< FAILURE! > java.lang.AssertionError: expected:<0> but was:<-1> > at org.junit.Assert.fail(Assert.java:88) > at org.junit.Assert.failNotEquals(Assert.java:834) > at org.junit.Assert.assertEquals(Assert.java:645) > at org.junit.Assert.assertEquals(Assert.java:631) > at > org.apache.flink.core.fs.AbstractCloseableRegistryTest.testClose(AbstractCloseableRegistryTest.java:93) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)