[ 
https://issues.apache.org/jira/browse/CASSANDRA-21410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Capwell updated CASSANDRA-21410:
--------------------------------------
    Test and Documentation Plan: new test that validated but removed test for 
maintenance reasons
                         Status: Patch Available  (was: Open)

> ShardDurability.markDefunct() called O(N²) times across topology updates, 
> causing log spam and OOM in tests
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-21410
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21410
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Accord
>            Reporter: David Capwell
>            Assignee: David Capwell
>            Priority: Normal
>             Fix For: 6.0.x
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> ShardDurability.updateTopology() has a bug where defunct schedulers 
> accumulate in the shardSchedulers map and are re-marked defunct on every 
> subsequent topology change, producing O(N²) log messages.
> The issue is in updateTopology():
> {code}
> shardSchedulers.putAll(prev);           // puts defunct schedulers back into 
> the map
> prev.forEach((r, s) -> s.markDefunct()); // marks them defunct (again)
> {code}
> When a topology change removes a shard range, its scheduler is marked defunct 
> but kept in shardSchedulers (via putAll) so it can finish in-flight work 
> before self-removing. However, on the next topology change, these 
> already-defunct schedulers are copied into the new prev map, survive the 
> removal loop (their range doesn't exist in the new topology), and get 
> markDefunct() called again. Every subsequent topology change re-processes all 
> previously-defunct schedulers that haven't yet self-removed.
> With N topology changes, markDefunct() is called 1 + 2 + 3 + ... + N = 
> N*(N+1)/2 times total.
> This was observed in CI running ShortReadProtectionTest, which is 
> parameterized with 24 combinations x 15 test methods = 360 iterations, each 
> creating a new table (and thus a new topology epoch). With 
> accord.shard_durability_target_splits=4, ShardDurability.java:173 produced 
> 173,534 INFO-level log lines across an 11-minute test run. The JUnit test 
> formatter buffers all stdout in a ByteArrayOutputStream with no size cap, and 
> the accumulated ~155 MiB of log output exhausted the 1G test JVM heap, 
> causing an OOM.
> This ticket / patch was generated by Opus 4.6



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to