[ https://issues.apache.org/jira/browse/CASSANDRA-16532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307422#comment-17307422 ]
Adam Holmberg commented on CASSANDRA-16532: ------------------------------------------- I think I see what's happening. Reference counting for ChunkCache buffer is [allowed to go below zero|https://github.com/apache/cassandra/blob/bf96367f4d55692017e144980cf17963e31df127/src/java/org/apache/cassandra/cache/ChunkCache.java#L135]. Then, it is possible to [find a non-zero refCount, return a non-null reference incrementing from -1 --> 0, and arrive at {{buffer}} finding references is now zero|https://github.com/apache/cassandra/blob/bf96367f4d55692017e144980cf17963e31df127/src/java/org/apache/cassandra/cache/ChunkCache.java#L111-L122]. We're getting in this state while racing with an async task which is currently closing the file: The file is being closed as part of the tidy task: {noformat} [junit-timeout] at org.apache.cassandra.cache.ChunkCache$Buffer.release(ChunkCache.java:158) [junit-timeout] at org.apache.cassandra.cache.ChunkCache.onRemoval(ChunkCache.java:187) [junit-timeout] at org.apache.cassandra.cache.ChunkCache.onRemoval(ChunkCache.java:41) [junit-timeout] at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$notifyRemoval$1(BoundedLocalCache.java:286) [junit-timeout] at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30) [junit-timeout] at com.github.benmanes.caffeine.cache.BoundedLocalCache.notifyRemoval(BoundedLocalCache.java:292) [junit-timeout] at com.github.benmanes.caffeine.cache.BoundedLocalCache.removeNoWriter(BoundedLocalCache.java:1731) [junit-timeout] at com.github.benmanes.caffeine.cache.BoundedLocalCache.remove(BoundedLocalCache.java:1695) [junit-timeout] at com.github.benmanes.caffeine.cache.LocalCache.invalidateAll(LocalCache.java:126) [junit-timeout] at com.github.benmanes.caffeine.cache.LocalManualCache.invalidateAll(LocalManualCache.java:79) [junit-timeout] at org.apache.cassandra.cache.ChunkCache.invalidateFile(ChunkCache.java:218) [junit-timeout] at org.apache.cassandra.io.util.FileHandle$Cleanup.lambda$tidy$0(FileHandle.java:208) [junit-timeout] at java.util.Optional.ifPresent(Optional.java:159) [junit-timeout] at org.apache.cassandra.io.util.FileHandle$Cleanup.tidy(FileHandle.java:208) [junit-timeout] at org.apache.cassandra.utils.concurrent.Ref$GlobalState.release(Ref.java:325) [junit-timeout] at org.apache.cassandra.utils.concurrent.Ref$State.ensureReleased(Ref.java:203) [junit-timeout] at org.apache.cassandra.utils.concurrent.Ref.ensureReleased(Ref.java:128) [junit-timeout] at org.apache.cassandra.utils.concurrent.SharedCloseableImpl.close(SharedCloseableImpl.java:45) [junit-timeout] at org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier$1.run(SSTableReader.java:2058) [junit-timeout] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) {noformat} Which was scheduled by the previous scrub test: {noformat} [junit-timeout] at org.apache.cassandra.io.sstable.format.SSTableReader$InstanceTidier.tidy(SSTableReader.java:2020) [junit-timeout] at org.apache.cassandra.utils.concurrent.Ref$GlobalState.release(Ref.java:325) [junit-timeout] at org.apache.cassandra.utils.concurrent.Ref$State.release(Ref.java:224) [junit-timeout] at org.apache.cassandra.utils.concurrent.Ref.release(Ref.java:118) [junit-timeout] at org.apache.cassandra.db.compaction.Scrubber.lambda$scrub$0(Scrubber.java:303) [junit-timeout] at java.util.ArrayList.forEach(ArrayList.java:1257) [junit-timeout] at org.apache.cassandra.db.compaction.Scrubber.scrub(Scrubber.java:303) [junit-timeout] at org.apache.cassandra.tools.StandaloneScrubber.main(StandaloneScrubber.java:226) [junit-timeout] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit-timeout] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) [junit-timeout] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [junit-timeout] at java.lang.reflect.Method.invoke(Method.java:498) [junit-timeout] at org.apache.cassandra.tools.ToolRunner.runClassAsTool(ToolRunner.java:82) [junit-timeout] at org.apache.cassandra.tools.ToolRunner$2.get(ToolRunner.java:249) [junit-timeout] at org.apache.cassandra.tools.ToolRunner$2.get(ToolRunner.java:245) [junit-timeout] at org.apache.cassandra.tools.ToolRunner.invokeSupplier(ToolRunner.java:305) [junit-timeout] at org.apache.cassandra.tools.ToolRunner.invokeClass(ToolRunner.java:253) [junit-timeout] at org.apache.cassandra.tools.ToolRunner.invokeClass(ToolRunner.java:235) [junit-timeout] at org.apache.cassandra.db.ScrubTest.testHeaderFixWithTool(ScrubTest.java:874) {noformat} I had hoped it would be sufficient to disallow negative numbers for the ref count, but at first blush that is revealing other issues. The work goes on. > Fix flaky testSkipScrubCorruptedCounterRowWithTool > -------------------------------------------------- > > Key: CASSANDRA-16532 > URL: https://issues.apache.org/jira/browse/CASSANDRA-16532 > Project: Cassandra > Issue Type: Bug > Components: Test/unit > Reporter: Berenguer Blasi > Assignee: Berenguer Blasi > Priority: Normal > Fix For: 4.0-rc > > Time Spent: 20m > Remaining Estimate: 0h > > Fix flaky > [testSkipScrubCorruptedCounterRowWithTool|https://ci-cassandra.apache.org/job/Cassandra-trunk/365/testReport/junit/org.apache.cassandra.db/ScrubTest/testSkipScrubCorruptedCounterRowWithTool_compression/] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org