Re: [DISCUSS] Cassandra Cross Data Center Read/Write with Remote Quorum During Data Center Failure

Patrick McFadin Tue, 02 Dec 2025 10:38:38 -0800

To be specific, I was advocating a Layer 7/DNS fail out of the affected DC. 
Something along this line: 
https://cloud.google.com/blog/products/networking/introducing-automated-failover-for-private-workloads-using-cloud-dns-routing-policies-with-health-checks




> On Dec 2, 2025, at 9:33 AM, Jon Haddad <[email protected]> wrote:
> 
> > This proposal targets the use case in which only a few nodes in the region 
> > fail, but not all.
> 
> If you're able to handle inconsistent results, why not use LOCAL_ONE? If you 
> have a preference for consistency but can tolerate some inconsistency 
> sometimes, DowngradingConsistencyRetryPolicy [1] might work for you. It's a 
> bit contentious in the community (some folks absolutely hate its existence), 
> but I've had a few use cases where it makes sense.
> 
> Or is this scenario when 3 nodes fail and they happen to contain the same 
> token range? I'm curious what the math looks like. Are you using 256 vnodes 
> by any chance?
> 
> If you execute a query and all 3 replicas in your local DC are down, 
> currently the driver throws an exception. I don't see how putting this logic 
> on the server helps, since the driver would need to be aware of REMOTE_QUORUM 
> to route to the remote replicas anyway. At that point you're just adding an 
> extra hop between the app and the database.
> 
> If you have a database outage so severe that all 3 replicas are unavailable, 
> what's the state of the rest of the cluster? I'm genuinely curious how often 
> you'd have a case where 3 nodes with intersecting ranges are offline and it's 
> not a complete DC failure.
> 
> I agree with Patrick and Stefan here. If you want to stay in the DC, doing it 
> at the driver level gives you complete control. If your infra is failing that 
> spectacularly, evacuate.
> 
> I also wonder if witness replicas would help you, if you're seeing that much 
> failure.
> 
> [1] 
> https://docs.datastax.com/en/drivers/java/4.17/com/datastax/oss/driver/api/core/retry/DowngradingConsistencyRetryPolicy.html
> 
> 
> On Mon, Dec 1, 2025 at 7:37 PM Patrick McFadin <[email protected] 
> <mailto:[email protected]>> wrote:
>> In my experience it usually makes more sense to evacuate out of a degraded 
>> DC at the application layer rather than having Cassandra silently fail over 
>> to a remote quorum. When a DC is in a bad state, it’s rarely “just a couple 
>> of Cassandra replicas are down.” The same underlying problem (network 
>> partitions, bad routing, DNS issues, load balancer problems, AZ failures, 
>> power incidents, I could go on...)
>> 
>> What I can see is that REMOTE_QUORUM will behave like a dynamic 
>> reconfiguration of the replica group for a given request. That’s 
>> fundamentally different from today’s NTS contract, where the replica set per 
>> DC is static. That would open up all sorts of new ops headaches. 
>>  - What happens replicas in the degraded DC? Would hints get stored?
>>  - I'm not sure if this would work with PAXOS since dynamic reconfiguration 
>> isn't supported. Two leaders possibly?
>>  - Which now brings up Accord and TCM considerations...
>> 
>> So my strong bias is, once a DC is degraded enough that Cassandra can’t 
>> achieve LOCAL_QUORUM, that DC is probably not a place you want to keep 
>> serving user traffic from. At that point, I’d rather have the application 
>> tier evacuate and talk to a healthy DC, instead of the database internally 
>> changing its consistency behavior under the covers. It's just a ton of work 
>> for a problem with a simpiler solution. 
>> 
>> Patrick
>> 
>> 
>> On Mon, Dec 1, 2025 at 7:07 PM Jaydeep Chovatia <[email protected] 
>> <mailto:[email protected]>> wrote:
>>> >For the record, there is whole section about this here (sorry for pasting 
>>> >the same link twice in my original mail)
>>> 
>>> Thanks, Stefan, for sharing the details about the client-side failover 
>>> mechanism. This technique is definitely useful when the entire region is 
>>> down.
>>> 
>>> This proposal targets the use case in which only a few nodes in the region 
>>> fail, but not all. For use cases that prioritize availability over latency, 
>>> they can avail this option, and the server will automatically fulfill those 
>>> requests from the remote region.
>>> 
>>> Jaydeep
>>> 
>>> On Mon, Dec 1, 2025 at 2:58 PM Štefan Miklošovič <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>>> For the record, there is whole section about this here (sorry for pasting 
>>>> the same link twice in my original mail)
>>>> 
>>>> https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/load_balancing/index.html#cross-datacenter-failover
>>>> 
>>>> Interesting to see your perspective, I can see how doing something without 
>>>> application changes might seem appealing. 
>>>> 
>>>> I wonder what others think about this.
>>>> 
>>>> Regards
>>>> 
>>>> On Mon, Dec 1, 2025 at 11:39 PM Qc L <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>>> Hi Stefan,
>>>>> 
>>>>> Thanks for raising the question about using a custom LoadBalancingPolicy! 
>>>>> Remote quorum and LB-based failover aim to solve similar problems, 
>>>>> maintaining availability during datacenter degradation, so I did a 
>>>>> comparison study on the server-side remote quorum solution versus relying 
>>>>> on driver-side logic.
>>>>> 
>>>>> 
>>>>> For our use cases, we found that client-side failover is expensive to 
>>>>> operate and leads to fragmented behavior during incidents. We also prefer 
>>>>> failover to be triggered only when needed, not automatically. A 
>>>>> server-side mechanism gives us controlled, predictable behavior and makes 
>>>>> the system easier to operate.
>>>>> Remote quorum is implemented on the server, where the coordinator can 
>>>>> intentionally form a quorum using replicas in a backup region when the 
>>>>> local region cannot satisfy LOCAL_QUORUM or EACH_QUORUM. The database 
>>>>> determines the fallback region, how replicas are selected, and ensures 
>>>>> consistency guarantees still hold.
>>>>> 
>>>>> By keeping this logic internal to the server, we provide a unified, 
>>>>> consistent behavior across all clients without requiring any application 
>>>>> changes.
>>>>> 
>>>>> Happy to discuss further if helpful.
>>>>> 
>>>>> 
>>>>> On Mon, Dec 1, 2025 at 12:57 PM Štefan Miklošovič <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>>> Hello,
>>>>>> 
>>>>>> before going deeper into your proposal ... I just have to ask, have you 
>>>>>> tried e.g. custom LoadBalancingPolicy in driver, if you use that?
>>>>>> 
>>>>>> I can imagine that if the driver detects a node to be down then based on 
>>>>>> its "distance" / dc it belongs to you might start to create a different 
>>>>>> query plan which talks to remote DC and similar. 
>>>>>> 
>>>>>> https://docs.datastax.com/en/drivers/java/4.2/com/datastax/oss/driver/api/core/loadbalancing/LoadBalancingPolicy.html
>>>>>> 
>>>>>> https://docs.datastax.com/en/drivers/java/4.2/com/datastax/oss/driver/api/core/loadbalancing/LoadBalancingPolicy.html
>>>>>> 
>>>>>> On Mon, Dec 1, 2025 at 8:54 PM Qc L <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> Hello team,
>>>>>>> 
>>>>>>> I’d like to propose adding the remote quorum to Cassandra consistency 
>>>>>>> level for handling simultaneous hosts unavailability in the local data 
>>>>>>> center that cannot achieve the required quorum.
>>>>>>> 
>>>>>>> Background
>>>>>>> NetworkTopologyStrategy is the most commonly used strategy at Uber, and 
>>>>>>> we use Local_Quorum for read/write in many use cases. Our Cassandra 
>>>>>>> deployment in each data center currently relies on majority replicas 
>>>>>>> being healthy to consistently achieve local quorum. 
>>>>>>> 
>>>>>>> Current behavior
>>>>>>> When a local data center in a Cassandra deployment experiences outages, 
>>>>>>> network isolation, or maintenance events, the EACH_QUORUM / 
>>>>>>> LOCAL_QUORUM consistency level will fail for both reads and writes if 
>>>>>>> enough replicas in that the wlocal data center are unavailable. In this 
>>>>>>> configuration, simultaneous hosts unavailability can temporarily 
>>>>>>> prevent the cluster from reaching the required quorum for reads and 
>>>>>>> writes. For applications that require high availability and a seamless 
>>>>>>> user experience, this can lead to service downtime and a noticeable 
>>>>>>> drop in overall availability. see attached figure1.<figure1.png>
>>>>>>> 
>>>>>>> Proposed Solution
>>>>>>> To prevent this issue and ensure a seamless user experience, we can use 
>>>>>>> the Remote Quorum consistency level as a fallback mechanism in 
>>>>>>> scenarios where local replicas are unavailable. Remote Quorum in 
>>>>>>> Cassandra refers to a read or write operation that achieves quorum (a 
>>>>>>> majority of replicas) in the remote data center, rather than relying 
>>>>>>> solely on replicas within the local data center. see attached Figure2. 
>>>>>>> 
>>>>>>> <figure2.png>
>>>>>>> 
>>>>>>> We will add a feature to do read/write consistency level override on 
>>>>>>> the server side. When local replicas are not available, we will 
>>>>>>> overwrite the server side write consistency level from each quorum to 
>>>>>>> remote quorum. Note that, implementing this change in client side will 
>>>>>>> require some protocol changes in CQL, we only add this on server side 
>>>>>>> which can only be used by server internal.
>>>>>>> 
>>>>>>> For example, giving the following Cassandra setup
>>>>>>> CREATE KEYSPACE ks WITH REPLICATION = {
>>>>>>>   'class': 'NetworkTopologyStrategy',
>>>>>>>   'cluster1': 3,
>>>>>>>   'cluster2': 3,
>>>>>>>   'cluster3': 3
>>>>>>> };
>>>>>>> 
>>>>>>> The selected approach for this design is to explicitly configure a 
>>>>>>> backup data center mapping for the local data center, where each data 
>>>>>>> center defines its preferred failover target. For example
>>>>>>> remote_quorum_target_data_center:
>>>>>>>   cluster1: cluster2
>>>>>>>   cluster2: cluster3
>>>>>>>   cluster3: cluster1
>>>>>>> 
>>>>>>> Implementations
>>>>>>> We proposed the following feature to Cassandra to address data center 
>>>>>>> failure scenarios
>>>>>>> Introduce a new Consistency level called remote quorum.
>>>>>>> Feature to do read/write consistency level override on server side. 
>>>>>>> (This can be controlled by a feature flag). Use Node tools command to 
>>>>>>> turn on/off the server failback
>>>>>>> Why remote quorum is useful
>>>>>>> As shown in the figure 2, we have the data center failure in one of the 
>>>>>>> data centers, Local quorum and each quorum will fail since two replicas 
>>>>>>> are unavailable and cannot meet the quorum requirement in the local 
>>>>>>> data center.
>>>>>>> 
>>>>>>> During the incident, we can use the nodetool to enable failover to 
>>>>>>> remote quorum. With the failover enabled, we can failover to the remote 
>>>>>>> data center for read and write, which avoids the available drop.
>>>>>>> 
>>>>>>> Any thoughts or concerns?
>>>>>>> Thanks in advance for your feedback,
>>>>>>> 
>>>>>>> Qiaochu
>>>>>>>

Re: [DISCUSS] Cassandra Cross Data Center Read/Write with Remote Quorum During Data Center Failure

Reply via email to