Re: [DISCUSS] Cassandra Cross Data Center Read/Write with Remote Quorum During Data Center Failure

Isaac Reath Tue, 02 Dec 2025 10:48:54 -0800

This feature seems very similar to what Blake is proposing in CEP-58:
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-58:+Satellite+Datacenters
(Ongoing discussion thread:
https://lists.apache.org/thread/pk9zdf6qv3bvq55j2wnoh7g0lv89yllj). Outside
of cost (which admittedly, is a big factor), do you see other reasons why
you wouldn't want to use satellite data centers to achieve server-side
failover?


That said, the proposal to set consistency levels server-side is
interesting to me and something I've even done a quick POC on. Even beyond
enabling the kind of failover you’re proposing, I see value in shifting
consistency level selection from the application developers using the
cluster to the operators managing the cluster when these are distinct
groups. It feels like a natural evolution of the existing read/write
consistency-level guardrails and moves the conversation from *“*you’re
allowed to use CL X for reads and Y for writes; make sure to set them
correctly in your apps” to “don’t worry about selecting consistency levels;
the ops team will manage that centrally.”


On Tue, Dec 2, 2025 at 6:38 PM Patrick McFadin <[email protected]> wrote:

> To be specific, I was advocating a Layer 7/DNS fail out of the affected
> DC. Something along this line:
> https://cloud.google.com/blog/products/networking/introducing-automated-failover-for-private-workloads-using-cloud-dns-routing-policies-with-health-checks
>
>
>
> On Dec 2, 2025, at 9:33 AM, Jon Haddad <[email protected]> wrote:
>
> > This proposal targets the use case in which only a few nodes in the
> region fail, but not all.
>
> If you're able to handle inconsistent results, why not use LOCAL_ONE? If
> you have a preference for consistency but can tolerate some inconsistency
> sometimes, DowngradingConsistencyRetryPolicy [1] might work for you. It's a
> bit contentious in the community (some folks absolutely hate its
> existence), but I've had a few use cases where it makes sense.
>
> Or is this scenario when 3 nodes fail and they happen to contain the same
> token range? I'm curious what the math looks like. Are you using 256 vnodes
> by any chance?
>
> If you execute a query and all 3 replicas in your local DC are down,
> currently the driver throws an exception. I don't see how putting this
> logic on the server helps, since the driver would need to be aware of
> REMOTE_QUORUM to route to the remote replicas anyway. At that point you're
> just adding an extra hop between the app and the database.
>
> If you have a database outage so severe that all 3 replicas are
> unavailable, what's the state of the rest of the cluster? I'm genuinely
> curious how often you'd have a case where 3 nodes with intersecting ranges
> are offline and it's not a complete DC failure.
>
> I agree with Patrick and Stefan here. If you want to stay in the DC, doing
> it at the driver level gives you complete control. If your infra is failing
> that spectacularly, evacuate.
>
> I also wonder if witness replicas would help you, if you're seeing that
> much failure.
>
> [1]
> https://docs.datastax.com/en/drivers/java/4.17/com/datastax/oss/driver/api/core/retry/DowngradingConsistencyRetryPolicy.html
>
>
> On Mon, Dec 1, 2025 at 7:37 PM Patrick McFadin <[email protected]> wrote:
>
>> In my experience it usually makes more sense to evacuate out of a
>> degraded DC at the application layer rather than having Cassandra silently
>> fail over to a remote quorum. When a DC is in a bad state, it’s rarely
>> “just a couple of Cassandra replicas are down.” The same underlying problem
>> (network partitions, bad routing, DNS issues, load balancer problems, AZ
>> failures, power incidents, I could go on...)
>> What I can see is that REMOTE_QUORUM will behave like a dynamic
>> reconfiguration of the replica group for a given request. That’s
>> fundamentally different from today’s NTS contract, where the replica set
>> per DC is static. That would open up all sorts of new ops headaches.
>>  - What happens replicas in the degraded DC? Would hints get stored?
>>  - I'm not sure if this would work with PAXOS since dynamic
>> reconfiguration isn't supported. Two leaders possibly?
>>  - Which now brings up Accord and TCM considerations...
>>
>> So my strong bias is, once a DC is degraded enough that Cassandra can’t
>> achieve LOCAL_QUORUM, that DC is probably not a place you want to keep
>> serving user traffic from. At that point, I’d rather have the application
>> tier evacuate and talk to a healthy DC, instead of the database internally
>> changing its consistency behavior under the covers. It's just a ton of work
>> for a problem with a simpiler solution.
>>
>> Patrick
>>
>>
>> On Mon, Dec 1, 2025 at 7:07 PM Jaydeep Chovatia <
>> [email protected]> wrote:
>>
>>> >For the record, there is whole section about this here (sorry for
>>> pasting the same link twice in my original mail)
>>>
>>> Thanks, Stefan, for sharing the details about the client-side failover
>>> mechanism. This technique is definitely useful when the entire region is
>>> down.
>>>
>>> This proposal targets the use case in which only a few nodes in the
>>> region fail, but not all. For use cases that prioritize availability over
>>> latency, they can avail this option, and the server will automatically
>>> fulfill those requests from the remote region.
>>>
>>> Jaydeep
>>>
>>> On Mon, Dec 1, 2025 at 2:58 PM Štefan Miklošovič <[email protected]>
>>> wrote:
>>>
>>>> For the record, there is whole section about this here (sorry for
>>>> pasting the same link twice in my original mail)
>>>>
>>>>
>>>> https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/load_balancing/index.html#cross-datacenter-failover
>>>>
>>>> Interesting to see your perspective, I can see how doing something
>>>> without application changes might seem appealing.
>>>>
>>>> I wonder what others think about this.
>>>>
>>>> Regards
>>>>
>>>> On Mon, Dec 1, 2025 at 11:39 PM Qc L <[email protected]> wrote:
>>>>
>>>>> Hi Stefan,
>>>>>
>>>>> Thanks for raising the question about using a custom
>>>>> LoadBalancingPolicy! Remote quorum and LB-based failover aim to solve
>>>>> similar problems, maintaining availability during datacenter degradation,
>>>>> so I did a comparison study on the server-side remote quorum solution
>>>>> versus relying on driver-side logic.
>>>>>
>>>>>
>>>>>    - For our use cases, we found that client-side failover is
>>>>>    expensive to operate and leads to fragmented behavior during 
>>>>> incidents. We
>>>>>    also prefer failover to be triggered only when needed, not 
>>>>> automatically. A
>>>>>    server-side mechanism gives us controlled, predictable behavior and 
>>>>> makes
>>>>>    the system easier to operate.
>>>>>    - Remote quorum is implemented on the server, where the
>>>>>    coordinator can intentionally form a quorum using replicas in a backup
>>>>>    region when the local region cannot satisfy LOCAL_QUORUM or
>>>>>    EACH_QUORUM. The database determines the fallback region, how
>>>>>    replicas are selected, and ensures consistency guarantees still hold.
>>>>>
>>>>>
>>>>> By keeping this logic internal to the server, we provide a unified,
>>>>> consistent behavior across all clients without requiring any application
>>>>> changes.
>>>>>
>>>>> Happy to discuss further if helpful.
>>>>>
>>>>> On Mon, Dec 1, 2025 at 12:57 PM Štefan Miklošovič <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> before going deeper into your proposal ... I just have to ask, have
>>>>>> you tried e.g. custom LoadBalancingPolicy in driver, if you use that?
>>>>>>
>>>>>> I can imagine that if the driver detects a node to be down then based
>>>>>> on its "distance" / dc it belongs to you might start to create a 
>>>>>> different
>>>>>> query plan which talks to remote DC and similar.
>>>>>>
>>>>>>
>>>>>> https://docs.datastax.com/en/drivers/java/4.2/com/datastax/oss/driver/api/core/loadbalancing/LoadBalancingPolicy.html
>>>>>>
>>>>>>
>>>>>> https://docs.datastax.com/en/drivers/java/4.2/com/datastax/oss/driver/api/core/loadbalancing/LoadBalancingPolicy.html
>>>>>>
>>>>>> On Mon, Dec 1, 2025 at 8:54 PM Qc L <[email protected]> wrote:
>>>>>>
>>>>>>> Hello team,
>>>>>>>
>>>>>>> I’d like to propose adding the remote quorum to Cassandra
>>>>>>> consistency level for handling simultaneous hosts unavailability in the
>>>>>>> local data center that cannot achieve the required quorum.
>>>>>>>
>>>>>>> Background
>>>>>>> NetworkTopologyStrategy is the most commonly used strategy at Uber,
>>>>>>> and we use Local_Quorum for read/write in many use cases. Our Cassandra
>>>>>>> deployment in each data center currently relies on majority replicas 
>>>>>>> being
>>>>>>> healthy to consistently achieve local quorum.
>>>>>>>
>>>>>>> Current behavior
>>>>>>> When a local data center in a Cassandra deployment experiences
>>>>>>> outages, network isolation, or maintenance events, the EACH_QUORUM /
>>>>>>> LOCAL_QUORUM consistency level will fail for both reads and writes if
>>>>>>> enough replicas in that the wlocal data center are unavailable. In this
>>>>>>> configuration, simultaneous hosts unavailability can temporarily prevent
>>>>>>> the cluster from reaching the required quorum for reads and writes. For
>>>>>>> applications that require high availability and a seamless user 
>>>>>>> experience,
>>>>>>> this can lead to service downtime and a noticeable drop in overall
>>>>>>> availability. see attached figure1.<figure1.png>
>>>>>>>
>>>>>>> Proposed Solution
>>>>>>> To prevent this issue and ensure a seamless user experience, we can
>>>>>>> use the Remote Quorum consistency level as a fallback mechanism in
>>>>>>> scenarios where local replicas are unavailable. Remote Quorum in 
>>>>>>> Cassandra
>>>>>>> refers to a read or write operation that achieves quorum (a majority of
>>>>>>> replicas) in the remote data center, rather than relying solely on 
>>>>>>> replicas
>>>>>>> within the local data center. see attached Figure2.
>>>>>>>
>>>>>>> <figure2.png>
>>>>>>>
>>>>>>> We will add a feature to do read/write consistency level override on
>>>>>>> the server side. When local replicas are not available, we will 
>>>>>>> overwrite
>>>>>>> the server side write consistency level from each quorum to remote 
>>>>>>> quorum.
>>>>>>> Note that, implementing this change in client side will require some
>>>>>>> protocol changes in CQL, we only add this on server side which can only 
>>>>>>> be
>>>>>>> used by server internal.
>>>>>>>
>>>>>>> For example, giving the following Cassandra setup
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *CREATE KEYSPACE ks WITH REPLICATION = {  'class':
>>>>>>> 'NetworkTopologyStrategy',  'cluster1': 3,  'cluster2': 3,  'cluster3': 
>>>>>>> 3};*
>>>>>>>
>>>>>>> The selected approach for this design is to explicitly configure a
>>>>>>> backup data center mapping for the local data center, where each data
>>>>>>> center defines its preferred failover target. For example
>>>>>>> *remote_quorum_target_data_center:*
>>>>>>>
>>>>>>>
>>>>>>> *  cluster1: cluster2  cluster2: cluster3  cluster3: cluster1*
>>>>>>>
>>>>>>> Implementations
>>>>>>> We proposed the following feature to Cassandra to address data
>>>>>>> center failure scenarios
>>>>>>>
>>>>>>>    1. Introduce a new Consistency level called remote quorum.
>>>>>>>    2. Feature to do read/write consistency level override on server
>>>>>>>    side. (This can be controlled by a feature flag). Use Node tools 
>>>>>>> command to
>>>>>>>    turn on/off the server failback
>>>>>>>
>>>>>>> Why remote quorum is useful
>>>>>>> As shown in the figure 2, we have the data center failure in one of
>>>>>>> the data centers, Local quorum and each quorum will fail since two 
>>>>>>> replicas
>>>>>>> are unavailable and cannot meet the quorum requirement in the local data
>>>>>>> center.
>>>>>>>
>>>>>>> During the incident, we can use the nodetool to enable failover to
>>>>>>> remote quorum. With the failover enabled, we can failover to the remote
>>>>>>> data center for read and write, which avoids the available drop.
>>>>>>>
>>>>>>> Any thoughts or concerns?
>>>>>>> Thanks in advance for your feedback,
>>>>>>>
>>>>>>> Qiaochu
>>>>>>>
>>>>>>>
>

Re: [DISCUSS] Cassandra Cross Data Center Read/Write with Remote Quorum During Data Center Failure

Reply via email to