[
https://issues.apache.org/jira/browse/CASSANDRA-21049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Qiaochu Liu updated CASSANDRA-21049:
------------------------------------
Description:
h1. *Background*
NetworkTopologyStrategy is the most commonly used strategy at Uber, and we use
Local_Quorum for read/write in many use cases. Our Cassandra deployment in each
region currently relies on majority replicas being healthy to consistently
achieve local quorum.
h1. *Current behavior*
When a local region in a Cassandra deployment experiences outages, network
isolation, or maintenance events, the EACH_QUORUM / LOCAL_QUORUM consistency
level will fail for both reads and writes if enough replicas in that local
region are unavailable. In this configuration, simultaneous hosts
unavailability can temporarily prevent the cluster from reaching the required
quorum for reads and writes. For applications that require high availability
and a seamless user experience, this can lead to service downtime and a
noticeable drop in overall availability.
h1. *Proposed Solution*
To prevent this issue and ensure a seamless user experience, we can use the
*Remote Quorum* consistency level as a fallback mechanism in scenarios where
local replicas are unavailable. Remote Quorum in Cassandra refers to a read or
write operation that achieves quorum (a majority of replicas) across remote
regions, rather than relying solely on replicas within the local region.
The selected approach for this design is to explicitly configure a backup
region mapping for the local region, where each region defines its preferred
failover target. For example
|backup_regional_cluster:
cluster1: cluster2
cluster2: cluster3
cluster3: cluster1|
We will add a feature to do read/write consistency level override on the server
side. When local replicas are not available, we will overwrite the server side
write consistency level from each quorum to remote quorum. *Note that,*
implementing this change in client side will require some protocol changes in
CQL, we only add this on server side which can only be used by server internal.
h1. *Implementations*
We proposed the following feature to Cassandra to address regional failure
scenarios
* Introduce a new Consistency level called remote quorum
* Feature to do read/write consistency level override on server side. (This
can be controlled by a feature flag). Use Node tools command to turn on/off the
server failback
was:
h1. *Background*
NetworkTopologyStrategy is the most commonly used strategy at Uber, and we use
Local_Quorum for read/write in many use cases. Our Cassandra deployment in each
region currently relies on majority replicas being healthy to consistently
achieve local quorum.
h1. *Current behavior*
When a local region in a Cassandra deployment experiences outages, network
isolation, or maintenance events, the EACH_QUORUM / LOCAL_QUORUM consistency
level will fail for both reads and writes if enough replicas in that local
region are unavailable. In this configuration, simultaneous hosts
unavailability can temporarily prevent the cluster from reaching the required
quorum for reads and writes. For applications that require high availability
and a seamless user experience, this can lead to service downtime and a
noticeable drop in overall availability.
h1. *Proposed Solution*
To prevent this issue and ensure a seamless user experience, we can use the
*Remote Quorum* consistency level as a fallback mechanism in scenarios where
local replicas are unavailable. Remote Quorum in Cassandra refers to a read or
write operation that achieves quorum (a majority of replicas) across remote
regions, rather than relying solely on replicas within the local region.
The selected approach for this design is to explicitly configure a backup
region mapping for the local region, where each region defines its preferred
failover target. For example
|backup_regional_cluster:
cluster1: cluster2
cluster2: cluster3
cluster3: cluster3|
We will add a feature to do read/write consistency level override on the server
side. When local replicas are not available, we will overwrite the server side
write consistency level from each quorum to remote quorum. *Note that,*
implementing this change in client side will require some protocol changes in
CQL, we only add this on server side which can only be used by server internal.
h1. *Implementations*
We proposed the following feature to Cassandra to address regional failure
scenarios
* Introduce a new Consistency level called remote quorum
* Feature to do read/write consistency level override on server side. (This
can be controlled by a feature flag). Use Node tools command to turn on/off the
server failback
> Cassandra Cross Region Read/Write with Remote Quorum During Regional Failure
> ----------------------------------------------------------------------------
>
> Key: CASSANDRA-21049
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21049
> Project: Apache Cassandra
> Issue Type: Improvement
> Components: Consistency/Coordination
> Reporter: Qiaochu Liu
> Priority: Normal
> Attachments: detailed.png
>
>
> h1. *Background*
> NetworkTopologyStrategy is the most commonly used strategy at Uber, and we
> use Local_Quorum for read/write in many use cases. Our Cassandra deployment
> in each region currently relies on majority replicas being healthy to
> consistently achieve local quorum.
> h1. *Current behavior*
> When a local region in a Cassandra deployment experiences outages, network
> isolation, or maintenance events, the EACH_QUORUM / LOCAL_QUORUM consistency
> level will fail for both reads and writes if enough replicas in that local
> region are unavailable. In this configuration, simultaneous hosts
> unavailability can temporarily prevent the cluster from reaching the required
> quorum for reads and writes. For applications that require high availability
> and a seamless user experience, this can lead to service downtime and a
> noticeable drop in overall availability.
> h1. *Proposed Solution*
> To prevent this issue and ensure a seamless user experience, we can use the
> *Remote Quorum* consistency level as a fallback mechanism in scenarios where
> local replicas are unavailable. Remote Quorum in Cassandra refers to a read
> or write operation that achieves quorum (a majority of replicas) across
> remote regions, rather than relying solely on replicas within the local
> region.
> The selected approach for this design is to explicitly configure a backup
> region mapping for the local region, where each region defines its preferred
> failover target. For example
> |backup_regional_cluster:
> cluster1: cluster2
> cluster2: cluster3
> cluster3: cluster1|
> We will add a feature to do read/write consistency level override on the
> server side. When local replicas are not available, we will overwrite the
> server side write consistency level from each quorum to remote quorum. *Note
> that,* implementing this change in client side will require some protocol
> changes in CQL, we only add this on server side which can only be used by
> server internal.
> h1. *Implementations*
> We proposed the following feature to Cassandra to address regional failure
> scenarios
> * Introduce a new Consistency level called remote quorum
> * Feature to do read/write consistency level override on server side. (This
> can be controlled by a feature flag). Use Node tools command to turn on/off
> the server failback
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]