Runtian Liu created CASSANDRA-21200:
---------------------------------------
Summary: Replica Slot Grouping: Prevent write availability
degradation during topology changes
Key: CASSANDRA-21200
URL: https://issues.apache.org/jira/browse/CASSANDRA-21200
Project: Apache Cassandra
Issue Type: New Feature
Components: Consistency/Coordination
Reporter: Runtian Liu
Assignee: Runtian Liu
Summary
During topology changes (bootstrap, decommission, move, replacement), Cassandra
adds pending replicas to the write path and inflates {{blockFor}} by
{{{}pending.size(){}}}. This means writes must wait for acknowledgments from
both natural and pending replicas before succeeding. If the pending replica is
slow or unresponsive (common during streaming-heavy transitions), writes time
out even though the natural quorum is fully available.
This issue affects both regular writes and Lightweight Transactions / Paxos V2.
Problem
Consider a 3-node cluster with RF=3 and QUORUM writes ({{{}blockFor = 2{}}}).
When a 4th node begins bootstrapping:
* The pending replica is added to the write set, inflating {{blockFor}} to 3
* Writes now require 3 acknowledgments instead of 2
* If the bootstrapping node is busy streaming and responds slowly, writes time
out
* The cluster effectively loses write availability during what should be a
routine operation
The same issue occurs during node replacement, decommission, and range
movement. The fundamental problem is that {{blockFor}} inflation treats the
pending replica as an independent additional requirement, when logically it is
_transitioning into_ an existing replica's slot.
Proposed Solution: Replica Slot Grouping
Instead of inflating {{{}blockFor{}}}, group endpoints that are transitioning
the same logical replica position into a single "slot":
* A stable slot contains a single endpoint (no topology change in progress).
Requires 1 ack.
* A transitioning slot contains two endpoints (e.g., a leaving node and its
replacement, or a bootstrapping node and the existing owner). Requires acks
from all members (both old and new) for the slot to count as satisfied.
{{blockFor}} stays at the base quorum level (no inflation from pending
replicas). The quorum is counted in terms of slots rather than individual
endpoints.
Example: In the 3-node + 1 bootstrapping scenario above:
* 3 slots total: 2 stable (nodes that are unaffected) + 1 transitioning
(existing owner + bootstrapping node)
* QUORUM still requires 2 slots
* If the bootstrapping node is slow, only 2 stable slots respond — that's 2
slots, meeting quorum
* If the bootstrapping node responds, the transitioning slot is satisfied,
also meeting quorum
* In both cases, writes succeed without timeout
Correctness properties:
* Durability is maintained: when the transitioning slot is satisfied, writes
reach both old and new replicas
* Consistency is maintained: quorum intersection properties are preserved
because slot-based quorum >= natural quorum
* Availability is improved: writes no longer block on slow/unresponsive
pending replicas
Implementation across branches: TCM vs pre-TCM
In trunk (with TCM), the Transient Cluster Metadata framework already maintains
explicit transition state for each replica — the information about which
endpoints are joining, leaving, or replacing is directly available in the
cluster metadata. The slot grouping logic can consume this transition
information directly without any additional calculation.
In pre-TCM branches (4.x, 5.x), this transition information does not exist as a
first-class concept. The implementation needs to derive it by analyzing
{{TokenMetadata}} — computing which pending endpoints map to which existing
endpoints based on bootstrap tokens, leaving tokens, and the replication
strategy. This calculation is performed on topology changes and cached per
keyspace, so the hot write path only performs a map lookup.
Feature flag:
* {{replica_slot_grouping_enabled}} (cassandra.yaml, runtime-toggleable via
JMX)
* Default: disabled (preserves existing behavior)
* When disabled: zero code path changes, no performance impact
Compatibility:
* Purely additive; no wire protocol or schema changes
* Feature flag ensures safe rollout and rollback
* Can be backported to 4.x and 5.x
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]