FYI there's a healthy slack thread discussing this found here: https://the-asf.slack.com/archives/CK23JSY2K/p1762834946972609
From that the concerns left (iiuc) are: - cases where a replacing node is removed and we return the original being-replaced node to the cluster, - cases where multiple nodes are replacing and gossip leaves coordinators seeing different states of some, or no, nodes as JOINING/NORMAL. Given these concerns: the possibilities operators are doing things in unexpected ways; the feature flag is warranted on non-trunk branches. But does trunk need the flag ? It sounds like neither concern exists in trunk but there was a desire to do it with more changes in trunk which i'm not grokking…? > On 3 Dec 2025, at 18:32, Runtian Liu <[email protected]> wrote: > > Hi all, > Just bumping this thread in case it was missed the first time. > I’ve updated CASSANDRA-20993 with a detailed Correctness / Safety section > that explains why excluding the pending replacement node from blockFor during > node replacement does not weaken read-after-write guarantees for any > combination of write CL and read CL. The key point is that the effective > number of natural replicas that must acknowledge a write (and be consulted > for a read) is unchanged; we only stop inflating blockFor with the pending > replacement. > For example, in the common RF=3, QUORUM write + QUORUM read case, the proof > shows that during a C → D replacement: > • Every successful QUORUM write is still guaranteed to be stored on a > quorum of naturals (e.g., A and B), and > • Every QUORUM read—both before and after the replacement completes—must > intersect {A, B}, so it always sees the latest value. > The more general argument in the ticket covers all CL pairs and shows that > the standard condition W_eff + R_eff > RF holds (or not) exactly as before; > the change only removes unnecessary write timeouts when the pending > replacement is slow. > If you have concerns about the correctness argument, or think there are > corner cases I’m missing (e.g., particular CL combinations or topology > transitions), I’d really appreciate feedback on the JIRA or in this thread. > Thanks, > Runtian > > On Tue, Nov 25, 2025 at 4:44 PM Runtian Liu <[email protected]> wrote: > Hi everyone, > I’d like to start a discussion about adjusting how Cassandra calculates > blockFor during node replacements. The JIRA tracking this proposal is here: > https://issues.apache.org/jira/browse/CASSANDRA-20993 > Problem Background > Today, during a replacement, the pending replica is always included when > determining the required acknowledgments. For example, with RF=3 and > LOCAL_QUORUM, the coordinator waits for three responses instead of two. Since > replacement nodes are often bootstrapping and slow to respond, this can > result in write timeouts or increased write latency—even though the client > only requested acknowledgments from the natural replicas. > This behavior effectively breaks the client contract by requiring more > responses than the specified consistency level. > Proposed Change > For replacement scenarios only, exclude pending replicas from blockFor and > require acknowledgments solely from natural replicas. Pending nodes will > still receive writes, but their responses will not count toward satisfying > the consistency level. > Responses from the node being replaced would also be ignored. Although it is > uncommon for a replaced node to become reachable again, adding this safeguard > avoids ambiguity and ensures correctness if that situation occurs. > This change would be disabled by default and controlled via a feature flag to > avoid affecting existing deployments. > In my view, this behavior is effectively a bug because the coordinator waits > for more acknowledgments than the client requested, leading to avoidable > failures or latency. Since the issue affects correctness from the client > perspective rather than introducing new semantics, it would be valuable to > include this fix in the 4.x branches as well, with the behavior disabled by > default where needed. > Motivation > This change: > • Prevents unnecessary write timeouts during replacements > > • Reduces write latency by eliminating dependence on a busy pending > replica > > • Aligns server behavior with client expectations > Current Status > A PR for 4.1 is available here for review: > https://github.com/apache/cassandra/pull/4494 > Feedback is welcome on both the implementation and the approach. > Next Steps > I’d appreciate input on: > • Any correctness concerns for replacement scenarios > > • Whether a feature-flagged approach is acceptable > > Thanks in advance for your feedback, > Runtian > >
