[DISCUSS] Updating blockFor Behavior During Node Replacement to Improve Availability and Latency

Runtian Liu Tue, 25 Nov 2025 16:45:28 -0800

Hi everyone,

I’d like to start a discussion about adjusting how Cassandra calculates
blockFor during node replacements. The JIRA tracking this proposal is here:
https://issues.apache.org/jira/browse/CASSANDRA-20993
Problem Background


Today, during a replacement, the pending replica is always included when
determining the required acknowledgments. For example, with RF=3 and
LOCAL_QUORUM, the coordinator waits for three responses instead of two.
Since replacement nodes are often bootstrapping and slow to respond, this
can result in write timeouts or increased write latency—even though the
client only requested acknowledgments from the natural replicas.

This behavior effectively breaks the client contract by requiring more
responses than the specified consistency level.
Proposed Change

For replacement scenarios only, exclude pending replicas from blockFor and
require acknowledgments solely from natural replicas. Pending nodes will
still receive writes, but their responses will not count toward satisfying
the consistency level.

Responses from the node being replaced would also be ignored. Although it
is uncommon for a replaced node to become reachable again, adding this
safeguard avoids ambiguity and ensures correctness if that situation occurs.

This change would be disabled by default and controlled via a feature flag
to avoid affecting existing deployments.

In my view, this behavior is effectively a bug because the coordinator
waits for more acknowledgments than the client requested, leading to
avoidable failures or latency. Since the issue affects correctness from the
client perspective rather than introducing new semantics, it would be
valuable to include this fix in the 4.x branches as well, with the behavior
disabled by default where needed.
Motivation

This change:

   -

   Prevents unnecessary write timeouts during replacements

   -

   Reduces write latency by eliminating dependence on a busy pending replica

   -

   Aligns server behavior with client expectations

Current Status

A PR for 4.1 is available here for review:
https://github.com/apache/cassandra/pull/4494

Feedback is welcome on both the implementation and the approach.
Next Steps

I’d appreciate input on:

   1.

   Any correctness concerns for replacement scenarios

   2.

   Whether a feature-flagged approach is acceptable


Thanks in advance for your feedback,
Runtian

[DISCUSS] Updating blockFor Behavior During Node Replacement to Improve Availability and Latency

Reply via email to