[ 
https://issues.apache.org/jira/browse/CASSANDRA-21186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18061151#comment-18061151
 ] 

Dmitry Konstantinov edited comment on CASSANDRA-21186 at 2/25/26 7:39 PM:
--------------------------------------------------------------------------

Additionally, we have an implicit leak (which affects tests mostly but probably 
can be visible also if we add/remove nodes a lot..) - AccordCommandStores 
schedules a periodic task and does not cancel it on shutdown. So, it keeps the 
reference to the AccordCommandStores as well as to terminated AccordExecutors 
(which keep references to the inner terminated threads):
!threads_kept_by_accord.png|width=500! 
!accord_stores_scheduled_task_leak.png|width=500!


was (Author: dnk):
Additionally, we have an implicit leak (which affects tests mostly but maybe 
can be visible if we add/remove nodes a lot..) - AccordCommandStores schedules 
a periodic task and does not cancel it on shutdown. So, it keeps the reference 
to the AccordCommandStores as well as to terminated AccordExecutors (which keep 
references to the inner terminated threads):
 !threads_kept_by_accord.png|width=500! 
 !accord_stores_scheduled_task_leak.png|width=500! 


> Test failure: org.apache.cassandra.index.accord.RouteIndexTest
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-21186
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21186
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/java
>            Reporter: Dmitry Konstantinov
>            Assignee: Dmitry Konstantinov
>            Priority: Normal
>             Fix For: 5.x
>
>         Attachments: RouteIndexTest_thread_dump.txt, 
> TEST-org.apache.cassandra.index.accord.RouteIndexTest.log.xz, 
> accord_executor_threads.png, accord_stores_scheduled_task_leak.png, 
> accoud_journal_threads.png, dominators_top_level.png, 
> histogram_retained_heap.png, metrics_cleaner_thread.png, 
> threads_kept_by_accord.png, total_number_of_threads.png
>
>
> The test is almost always fail now on the main CI, last example: 
> https://ci-cassandra.apache.org/job/Cassandra-trunk/2410/#showFailuresLink
> It happens for different JDKs and configurations (compression, cdc, oa)
> Usually it fails with a JUnit test timeout:
> {code}
> Timeout occurred. Please note the time in the report does not reflect the 
> time until the timeout.
> {code}
> Thanks to CASSANDRA-21172 we can see now a thread dump ( 
> [^RouteIndexTest_thread_dump.txt] ) to clarify where did we stuck.
> https://ci-cassandra.apache.org/job/Cassandra-trunk/2410/cloudbees-pipeline-explorer/?line=450882
> test-cdc jdk11 3/20
> we can observe Java OOM as well as the stuck main thread:
> {code}
> [junit-timeout] "main" #1 prio=5 os_prio=0 cpu=9993.84ms elapsed=480.27s 
> tid=0x00007fc3e401f000 nid=0xf44 waiting on condition  [0x00007fc3eabd0000]
> [junit-timeout]    java.lang.Thread.State: WAITING (parking)
> [junit-timeout]       at 
> jdk.internal.misc.Unsafe.park([email protected]/Native Method)
> [junit-timeout]       at 
> java.util.concurrent.locks.LockSupport.park([email protected]/LockSupport.java:323)
> [junit-timeout]       at 
> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:322)
> [junit-timeout]       at 
> org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:300)
> [junit-timeout]       at 
> org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:306)
> [junit-timeout]       at 
> org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:338)
> [junit-timeout]       at 
> org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitThrowUncheckedOnInterrupt(Awaitable.java:131)
> [junit-timeout]       at 
> org.apache.cassandra.utils.concurrent.Awaitable$AbstractAwaitable.awaitThrowUncheckedOnInterrupt(Awaitable.java:235)
> [junit-timeout]       at 
> org.apache.cassandra.tcm.log.LocalLog$Async.runOnce(LocalLog.java:760)
> [junit-timeout]       at 
> org.apache.cassandra.tcm.log.LocalLog$Async$AwaitCommit.get(LocalLog.java:868)
> [junit-timeout]       at 
> org.apache.cassandra.tcm.log.LocalLog$Async$AwaitCommit.get(LocalLog.java:860)
> [junit-timeout]       at 
> org.apache.cassandra.tcm.log.LocalLog$Async.awaitAtLeast(LocalLog.java:709)
> [junit-timeout]       at 
> org.apache.cassandra.tcm.AbstractLocalProcessor.commit(AbstractLocalProcessor.java:106)
> [junit-timeout]       at 
> org.apache.cassandra.tcm.ClusterMetadataService$SwitchableProcessor.commit(ClusterMetadataService.java:973)
> [junit-timeout]       at 
> org.apache.cassandra.tcm.Processor.commit(Processor.java:50)
> [junit-timeout]       at 
> org.apache.cassandra.tcm.ClusterMetadataService.commit(ClusterMetadataService.java:613)
> [junit-timeout]       at 
> org.apache.cassandra.tcm.ClusterMetadataService.commit(ClusterMetadataService.java:578)
> [junit-timeout]       at 
> org.apache.cassandra.ServerTestUtils$ResettableClusterMetadataService.reset(ServerTestUtils.java:385)
> [junit-timeout]       at 
> org.apache.cassandra.ServerTestUtils.resetCMS(ServerTestUtils.java:357)
> [junit-timeout]       at 
> org.apache.cassandra.cql3.CQLTester.afterTest(CQLTester.java:534)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to