[ 
https://issues.apache.org/jira/browse/CASSANDRA-21049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qiaochu Liu updated CASSANDRA-21049:
------------------------------------
    Description: 
h1. *Background*

NetworkTopologyStrategy is the most commonly used strategy at Uber, and we use 
Local_Quorum for read/write in many use cases. Our Cassandra deployment in each 
region currently relies on majority replicas being healthy to consistently 
achieve local quorum. 
h1. *Current behavior*

When a local region in a Cassandra deployment experiences outages, network 
isolation, or maintenance events, the EACH_QUORUM / LOCAL_QUORUM consistency 
level will fail for both reads and writes if enough replicas in that local 
region are unavailable. In this configuration, simultaneous hosts 
unavailability can temporarily prevent the cluster from reaching the required 
quorum for reads and writes. For applications that require high availability 
and a seamless user experience, this can lead to service downtime and a 
noticeable drop in overall availability.
h1. *Proposed Solution*

To prevent this issue and ensure a seamless user experience, we can use the 
*Remote Quorum* consistency level as a fallback mechanism in scenarios where 
local replicas are unavailable. Remote Quorum in Cassandra refers to a read or 
write operation that achieves quorum (a majority of replicas) across remote 
regions, rather than relying solely on replicas within the local region. 

The selected approach for this design is to explicitly configure a backup 
region mapping for the local region, where each region defines its preferred 
failover target. For example
|remote_quorum_target_data_center:
  cluster1: cluster2
  cluster2: cluster3
  cluster3: cluster1|

We will add a feature to do read/write consistency level override on the server 
side. When local replicas are not available, we will overwrite the server side 
write consistency level from each quorum to remote quorum. *Note that,* 
implementing this change in client side will require some protocol changes in 
CQL, we only add this on server side which can only be used by server internal.
h1. *Implementations*

We proposed the following feature to Cassandra to address regional failure 
scenarios 
 * Introduce a new Consistency level called remote quorum
 * Feature to do read/write consistency level override on server side. (This 
can be controlled by a feature flag). Use Node tools command to turn on/off the 
server failback 

 

  was:
h1. *Background*

NetworkTopologyStrategy is the most commonly used strategy at Uber, and we use 
Local_Quorum for read/write in many use cases. Our Cassandra deployment in each 
data center currently relies on majority replicas being healthy to consistently 
achieve local quorum. 
h1. *Current behavior*

When a local data center in a Cassandra deployment experiences outages, network 
isolation, or maintenance events, the EACH_QUORUM / LOCAL_QUORUM consistency 
level will fail for both reads and writes if enough replicas in that local data 
center are unavailable. In this configuration, simultaneous hosts 
unavailability can temporarily prevent the cluster from reaching the required 
quorum for reads and writes. For applications that require high availability 
and a seamless user experience, this can lead to service downtime and a 
noticeable drop in overall availability.
h1. *Proposed Solution*

To prevent this issue and ensure a seamless user experience, we can use the 
*Remote Quorum* consistency level as a fallback mechanism in scenarios where 
local replicas are unavailable. Remote Quorum in Cassandra refers to a read or 
write operation that achieves quorum (a majority of replicas) in the remote 
data center, rather than relying solely on replicas within the local data 
center. 

The selected approach for this design is to explicitly configure a backup data 
center mapping for the local data center, where each data center defines its 
preferred failover target. For example
|backup_data center:
  cluster1: cluster2
  cluster2: cluster3
  cluster3: cluster1|

We will add a feature to do read/write consistency level override on the server 
side. When local replicas are not available, we will overwrite the server side 
write consistency level from each quorum to remote quorum. *Note that,* 
implementing this change in client side will require some protocol changes in 
CQL, we only add this on server side which can only be used by server internal.
h1. *Implementations*

We proposed the following feature to Cassandra to address the data center 
failure scenarios 
 * Introduce a new Consistency level called remote quorum
 * Feature to do read/write consistency level override on server side. (This 
can be controlled by a feature flag). Use Node tools command to turn on/off the 
server failback 

 


> Cassandra Cross Data Center Read/Write with Remote Quorum During Data Center 
> Failure
> ------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-21049
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21049
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Consistency/Coordination
>            Reporter: Qiaochu Liu
>            Priority: Normal
>         Attachments: detailed.png
>
>
> h1. *Background*
> NetworkTopologyStrategy is the most commonly used strategy at Uber, and we 
> use Local_Quorum for read/write in many use cases. Our Cassandra deployment 
> in each region currently relies on majority replicas being healthy to 
> consistently achieve local quorum. 
> h1. *Current behavior*
> When a local region in a Cassandra deployment experiences outages, network 
> isolation, or maintenance events, the EACH_QUORUM / LOCAL_QUORUM consistency 
> level will fail for both reads and writes if enough replicas in that local 
> region are unavailable. In this configuration, simultaneous hosts 
> unavailability can temporarily prevent the cluster from reaching the required 
> quorum for reads and writes. For applications that require high availability 
> and a seamless user experience, this can lead to service downtime and a 
> noticeable drop in overall availability.
> h1. *Proposed Solution*
> To prevent this issue and ensure a seamless user experience, we can use the 
> *Remote Quorum* consistency level as a fallback mechanism in scenarios where 
> local replicas are unavailable. Remote Quorum in Cassandra refers to a read 
> or write operation that achieves quorum (a majority of replicas) across 
> remote regions, rather than relying solely on replicas within the local 
> region. 
> The selected approach for this design is to explicitly configure a backup 
> region mapping for the local region, where each region defines its preferred 
> failover target. For example
> |remote_quorum_target_data_center:
>   cluster1: cluster2
>   cluster2: cluster3
>   cluster3: cluster1|
> We will add a feature to do read/write consistency level override on the 
> server side. When local replicas are not available, we will overwrite the 
> server side write consistency level from each quorum to remote quorum. *Note 
> that,* implementing this change in client side will require some protocol 
> changes in CQL, we only add this on server side which can only be used by 
> server internal.
> h1. *Implementations*
> We proposed the following feature to Cassandra to address regional failure 
> scenarios 
>  * Introduce a new Consistency level called remote quorum
>  * Feature to do read/write consistency level override on server side. (This 
> can be controlled by a feature flag). Use Node tools command to turn on/off 
> the server failback 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to