[jira] [Comment Edited] (CASSANDRA-21189) Fix flaky DTest: InProgressSequenceCoordinationTest

Sam Lightfoot (Jira) Wed, 25 Feb 2026 09:44:08 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-21189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18061011#comment-18061011
 ]


Sam Lightfoot edited comment on CASSANDRA-21189 at 2/25/26 5:43 PM:
--------------------------------------------------------------------

The triggering error that causes a chain of port errors is from a Paxos commit 
that times out:
{code:java}
Caused an ERROR
[2026-02-25T09:27:27.026Z] [junit-timeout] 
java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Can 
not commit transformation: "SERVER_ERROR"(Could not perform commit; policy 
Retry{remainingMs=0, attempts=2} gave up). {code}
This timeout is configured on the cluster builder to 1 second (overriding from 
10s default)
{code:java}
try (Cluster cluster = builder().withNodes(3)
                                .appendConfig(cfg -> 
cfg.set("progress_barrier_timeout", "5000ms")
.set("request_timeout", "1000ms")
.set("progress_barrier_backoff", "100ms")
{ {code}
The request_timeout effectively becomes the ceiling for the entire Paxos 
commit, and because a successful error response is returned, it does not get 
retried within the cms_await_timeout budget (10m).

I think a fairly safe option is to increase the 1000ms request_timeout from the 
three tests where it is set, or remove it completely, given the resource 
constraints of CI.

The following port related issues seem to come from some missing cleanup 
behaviour when a cluster is unable to start, thus the following tests after the 
initial one fails with SERVER_ERROR all fail with port binding issues.

 


was (Author: JIRAUSER302824):
The triggering error that causes a chain of port errors is from a Paxos commit 
that times out:
{code:java}
Caused an ERROR
[2026-02-25T09:27:27.026Z] [junit-timeout] 
java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Can 
not commit transformation: "SERVER_ERROR"(Could not perform commit; policy 
Retry{remainingMs=0, attempts=2} gave up). {code}
This timeout is configured on the cluster builder to 1 second (overriding from 
10s default)
{code:java}
try (Cluster cluster = builder().withNodes(3)
                                .appendConfig(cfg -> 
cfg.set("progress_barrier_timeout", "5000ms")
.set("request_timeout", "1000ms")
.set("progress_barrier_backoff", "100ms")
{ {code}
The request_timeout effectively becomes the ceiling for the entire Paxos 
commit, and because a successful error response is returned, it does not get 
retried within the cms_await_timeout budget (significantly larger).

I think a fairly safe option is to increase the 1000ms request_timeout from the 
three tests where it is set, or remove it completely, given the resource 
constraints of CI.

The following port related issues seem to come from some missing cleanup 
behaviour when a cluster is unable to start, thus the following tests after the 
initial one fails with SERVER_ERROR all fail with port binding issues.

 

> Fix flaky DTest: InProgressSequenceCoordinationTest
> ---------------------------------------------------
>
>                 Key: CASSANDRA-21189
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21189
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Test/dtest/java
>            Reporter: Sam Lightfoot
>            Assignee: Sam Lightfoot
>            Priority: Normal
>             Fix For: 5.1
>
>
> There's a race condition between cluster closing and startup between test 
> scenarios due to lack of thread lifecycle handling. The spawned thread should 
> be joined before the test finishes to prevent the 'in-use port' errors.
> Affects
>  * bootstrapProgressTest
>  * decommissionProgressTest
>  * replacementProgressTest
> Adopt the same pattern as GossipTest with try-finally thread joining.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-21189) Fix flaky DTest: InProgressSequenceCoordinationTest

Reply via email to