[jira] [Updated] (CASSANDRA-20993) Unnecessary write timeouts during node replacement due to conservative pending node accounting in consistency level calculations

Runtian Liu (Jira) Tue, 25 Nov 2025 16:15:10 -0800


     [ 
https://issues.apache.org/jira/browse/CASSANDRA-20993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Runtian Liu updated CASSANDRA-20993:
------------------------------------
    Description: 
h3. Background

Cassandra includes pending nodes in the write quorum calculation to ensure 
consistency level guarantees are maintained during topology changes (e.g., 
bootstrap, node replacement). This is implemented in 
ConsistencyLevel.blockForWrite() where pending replicas are added to the 
required blockFor count.
h3. Current Behavior

When a node is bootstrapping or being added to the cluster, all pending 
replicas are unconditionally added to the blockFor requirement, regardless of 
whether they are replacement nodes or new capacity additions.Example with RF=3, 
CL=LOCAL_QUORUM: * New node bootstrap (increasing capacity): blockFor = 2 
(quorum) → with pending = 3 
 * Node replacement (old node being replaced): blockFor = 2 → with pending = 3 

h3. Problem Statement

During a node replacement scenario where a pending node is joining to replace 
an existing node:Current situation: * Natural replicas: A, B, C (C is being 
replaced)
 * Pending replica: D (replacement for C)

 * Live replicas: A, B, D (3 live nodes after C goes down)

 * Required ACKs: 3 (base quorum of 2 + 1 pending)

The issue: # Although we have 3 live replicas capable of responding, the 
blockFor requirement of 3 is artificially inflated
 # The pending replica (D) is typically busy during bootstrap and may respond 
slowly

 # Even though we've received ACKs from A and B (satisfying quorum of natural 
replicas), the write blocks waiting for an ACK from the slow pending replica D

 # This unnecessary dependency on the replacement pending replica's 
responsiveness causes write timeouts during node replacement operations

h3. Root Cause

As the original comment in the ticket 
https://issues.apache.org/jira/browse/CASSANDRA-833 mentioned:
{quote}we want to satisfy CL for both the pre- and post-bootstrap nodes (in 
case bootstrap aborts). This requires treating the old/new range owner as a 
unit: both D *and* C need to accept the write for it to count towards CL. So 
rather than considering
Unknown macro: \{A, B, C, D}
we should consider
Unknown macro: \{A, B, (C, D)}
This is a lot of complexity to introduce.
{quote}
the current implementation is a "simplification" idea.
The current implementation conflates natural and pending replicas into a single 
blockFor calculation: * blockFor = quorum(all replicas including pending)

However, the two replica types serve different purposes:

      * Natural replicas: The authoritative owners of the data (by replication 
strategy)
 * Pending replicas: Temporary nodes either adding capacity (new bootstrap) or 
replacing an existing node (node replacement)

The current approach treats all pending nodes identically, but they should be 
handled differently based on their topology role.
h3. Proposed Solution

For quorum consistency level:

Decouple blockFor calculation into separate requirements for natural and 
pending replicas:For normal operations and new bootstrap (Keep current behavior 
as the implementation might be complex and we can do later): * blockFor = 
quorum(natural replicas) + all pending replicas
 * This ensures: quorum of natural replicas respond, PLUS any pending nodes 
respond

 * Protects against topology change cancellation

{color:#de350b}For node replacement:{color} * {color:#de350b}blockFor = 
quorum(natural replicas) only{color}
 * {color:#de350b}This ensures: only quorum of natural replicas need to 
respond{color}

 * {color:#de350b}Pending replacement node responds when available but is not 
required{color}
 * {color:#de350b}Eliminates unnecessary dependency on busy pending replica 
during replacement{color}

h3. Implementation

For normal node bootstrap, we will block as what we are doing now as it is too 
complicate to determine the unit of a (new owner and old owner).

For node replacements, as the old node is down, the (new node + old node) unit 
cannot response the write anyway. We will just block for the consistency 
level(natural replicas).

Only natural replicas' responses will be counted. Note, the response from the 
old node should not be counted as the old node should be shut down. If somehow 
it was able to ack writes, we should still ignore it.

The implementation should not be complicated as the node replacement 
relationship is clear.

 

As this is changing the behavior of the writes when there are pending nodes, 
the change will be added with a new config and by default the feature will be 
disabled.
 

  was:
h3. Background

Cassandra includes pending nodes in the write quorum calculation to ensure 
consistency level guarantees are maintained during topology changes (e.g., 
bootstrap, node replacement). This is implemented in 
ConsistencyLevel.blockForWrite() where pending replicas are added to the 
required blockFor count.
h3. Current Behavior

When a node is bootstrapping or being added to the cluster, all pending 
replicas are unconditionally added to the blockFor requirement, regardless of 
whether they are replacement nodes or new capacity additions.Example with RF=3, 
CL=LOCAL_QUORUM: * New node bootstrap (increasing capacity): blockFor = 2 
(quorum) → with pending = 3 
 * Node replacement (old node being replaced): blockFor = 2 → with pending = 3 

h3. Problem Statement

During a node replacement scenario where a pending node is joining to replace 
an existing node:Current situation: * Natural replicas: A, B, C (C is being 
replaced)
 * Pending replica: D (replacement for C)

 * Live replicas: A, B, D (3 live nodes after C goes down)

 * Required ACKs: 3 (base quorum of 2 + 1 pending)

The issue: # Although we have 3 live replicas capable of responding, the 
blockFor requirement of 3 is artificially inflated
 # The pending replica (D) is typically busy during bootstrap and may respond 
slowly

 # Even though we've received ACKs from A and B (satisfying quorum of natural 
replicas), the write blocks waiting for an ACK from the slow pending replica D

 # This unnecessary dependency on the replacement pending replica's 
responsiveness causes write timeouts during node replacement operations

h3. Root Cause

As the original comment in the ticket 
https://issues.apache.org/jira/browse/CASSANDRA-833 mentioned:
{quote}we want to satisfy CL for both the pre- and post-bootstrap nodes (in 
case bootstrap aborts). This requires treating the old/new range owner as a 
unit: both D *and* C need to accept the write for it to count towards CL. So 
rather than considering
Unknown macro: \{A, B, C, D}
we should consider
Unknown macro: \{A, B, (C, D)}
This is a lot of complexity to introduce.
{quote}
the current implementation is a "simplification" idea.
The current implementation conflates natural and pending replicas into a single 
blockFor calculation: * blockFor = quorum(all replicas including pending)

However, the two replica types serve different purposes: * Natural replicas: 
The authoritative owners of the data (by replication strategy)
 * Pending replicas: Temporary nodes either adding capacity (new bootstrap) or 
replacing an existing node (node replacement)

The current approach treats all pending nodes identically, but they should be 
handled differently based on their topology role.
h3. Proposed Solution

For quorum consistency level:

Decouple blockFor calculation into separate requirements for natural and 
pending replicas:For normal operations and new bootstrap (Keep current behavior 
as the implementation might be complex and we can do later): * blockFor = 
quorum(natural replicas) + all pending replicas
 * This ensures: quorum of natural replicas respond, PLUS any pending nodes 
respond

 * Protects against topology change cancellation

{color:#de350b}For node replacement:{color} * {color:#de350b}blockFor = 
quorum(natural replicas) only{color}
 * {color:#de350b}This ensures: only quorum of natural replicas need to 
respond{color}

 * {color:#de350b}Pending replacement node responds when available but is not 
required{color}
 * {color:#de350b}Eliminates unnecessary dependency on busy pending replica 
during replacement{color}

h3. Implementation

For normal node bootstrap, we will block as what we are doing now as it is too 
complicate to determine the unit of a (new owner and old owner).

For node replacements, as the old node is down, the (new node + old node) unit 
cannot response the write anyway. We will just block for the consistency 
level(natural replicas).

The implementation should not be complicated as the node replacement 
relationship is clear.
 


> Unnecessary write timeouts during node replacement due to conservative 
> pending node accounting in consistency level calculations
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-20993
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20993
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Consistency/Bootstrap and Decommission
>            Reporter: Runtian Liu
>            Priority: Normal
>
> h3. Background
> Cassandra includes pending nodes in the write quorum calculation to ensure 
> consistency level guarantees are maintained during topology changes (e.g., 
> bootstrap, node replacement). This is implemented in 
> ConsistencyLevel.blockForWrite() where pending replicas are added to the 
> required blockFor count.
> h3. Current Behavior
> When a node is bootstrapping or being added to the cluster, all pending 
> replicas are unconditionally added to the blockFor requirement, regardless of 
> whether they are replacement nodes or new capacity additions.Example with 
> RF=3, CL=LOCAL_QUORUM: * New node bootstrap (increasing capacity): blockFor = 
> 2 (quorum) → with pending = 3 
>  * Node replacement (old node being replaced): blockFor = 2 → with pending = 
> 3 
> h3. Problem Statement
> During a node replacement scenario where a pending node is joining to replace 
> an existing node:Current situation: * Natural replicas: A, B, C (C is being 
> replaced)
>  * Pending replica: D (replacement for C)
>  * Live replicas: A, B, D (3 live nodes after C goes down)
>  * Required ACKs: 3 (base quorum of 2 + 1 pending)
> The issue: # Although we have 3 live replicas capable of responding, the 
> blockFor requirement of 3 is artificially inflated
>  # The pending replica (D) is typically busy during bootstrap and may respond 
> slowly
>  # Even though we've received ACKs from A and B (satisfying quorum of natural 
> replicas), the write blocks waiting for an ACK from the slow pending replica D
>  # This unnecessary dependency on the replacement pending replica's 
> responsiveness causes write timeouts during node replacement operations
> h3. Root Cause
> As the original comment in the ticket 
> https://issues.apache.org/jira/browse/CASSANDRA-833 mentioned:
> {quote}we want to satisfy CL for both the pre- and post-bootstrap nodes (in 
> case bootstrap aborts). This requires treating the old/new range owner as a 
> unit: both D *and* C need to accept the write for it to count towards CL. So 
> rather than considering
> Unknown macro: \{A, B, C, D}
> we should consider
> Unknown macro: \{A, B, (C, D)}
> This is a lot of complexity to introduce.
> {quote}
> the current implementation is a "simplification" idea.
> The current implementation conflates natural and pending replicas into a 
> single blockFor calculation: * blockFor = quorum(all replicas including 
> pending)
> However, the two replica types serve different purposes:
>       * Natural replicas: The authoritative owners of the data (by 
> replication strategy)
>  * Pending replicas: Temporary nodes either adding capacity (new bootstrap) 
> or replacing an existing node (node replacement)
> The current approach treats all pending nodes identically, but they should be 
> handled differently based on their topology role.
> h3. Proposed Solution
> For quorum consistency level:
> Decouple blockFor calculation into separate requirements for natural and 
> pending replicas:For normal operations and new bootstrap (Keep current 
> behavior as the implementation might be complex and we can do later): * 
> blockFor = quorum(natural replicas) + all pending replicas
>  * This ensures: quorum of natural replicas respond, PLUS any pending nodes 
> respond
>  * Protects against topology change cancellation
> {color:#de350b}For node replacement:{color} * {color:#de350b}blockFor = 
> quorum(natural replicas) only{color}
>  * {color:#de350b}This ensures: only quorum of natural replicas need to 
> respond{color}
>  * {color:#de350b}Pending replacement node responds when available but is not 
> required{color}
>  * {color:#de350b}Eliminates unnecessary dependency on busy pending replica 
> during replacement{color}
> h3. Implementation
> For normal node bootstrap, we will block as what we are doing now as it is 
> too complicate to determine the unit of a (new owner and old owner).
> For node replacements, as the old node is down, the (new node + old node) 
> unit cannot response the write anyway. We will just block for the consistency 
> level(natural replicas).
> Only natural replicas' responses will be counted. Note, the response from the 
> old node should not be counted as the old node should be shut down. If 
> somehow it was able to ack writes, we should still ignore it.
> The implementation should not be complicated as the node replacement 
> relationship is clear.
>  
> As this is changing the behavior of the writes when there are pending nodes, 
> the change will be added with a new config and by default the feature will be 
> disabled.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (CASSANDRA-20993) Unnecessary write timeouts during node replacement due to conservative pending node accounting in consistency level calculations

Reply via email to