[
https://issues.apache.org/jira/browse/CASSANDRA-17572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ekaterina Dimitrova updated CASSANDRA-17572:
--------------------------------------------
Fix Version/s: 3.0.x
3.11.x
> Race condition when IP address changes for a node can cause reads/writes to
> route to the wrong node
> ---------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-17572
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17572
> Project: Cassandra
> Issue Type: Bug
> Components: Cluster/Membership
> Reporter: Sam Kramer
> Priority: Normal
> Fix For: 3.0.x, 3.11.x, 4.x
>
>
> Hi,
> We noticed that there is a race condition present in the trunk of 3.x code,
> and confirmed that it’s there in 4.x as well, which will result in incorrect
> reads, and missed writes, for a very short period of time.
> What brought the race condition to our attention was due to the fact we
> started noticing a couple of missed writes for our Cassandra clusters in
> Kubernetes. We found the Kubernetes piece interesting, as IP changes are very
> frequent as opposed to a traditional setup.
> More concretely:
> # When a Cassandra node is turned off, and then starts with a new IP address
> Z (former IP address X), it announces to the cluster (via gossip) it has IP Z
> for Host ID Y
> # If there are no conflicts, each node will decide to remove the old IP
> address associated with Host ID Y
> ([https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/service/StorageService.java#L2529-L2532])
> from the storage ring. This also causes us to invalidate our token ring
> cache
> ([https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/locator/TokenMetadata.java#L488]
> ).
> # At this time, a new request could come in (read or write), and will
> re-calculate which endpoints to send the request to, as we’ve invalidated our
> token ring cache
> ([https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/locator/AbstractReplicationStrategy.java#L88-L104]).
> # However, at this time we’ve only removed the IP address X (former IP
> address), and have not re-added IP address Z.
> # As a result, we will choose a new host to route our request to. In our
> case, our keyspaces all run with NetworkTopologyStrategy, and so we simply
> choose the node with the next closest token in the same rack as host Y
> ([https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java#L149-L191]).
> # Thus, the request is routed to a _different_ host, rather than the host
> that has came back online.
> # However, shortly later, we re-add the host (via it’s _new_ endpoint) to
> the token ring
> [https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/service/StorageService.java#L2549]
> # This will result in us invalidating our cache, and then again re-routing
> requests appropriately.
> Couple of additional thoughts:
> - This doesn’t affect clusters where nodes <= RF with network topology
> strategy.
> - During this very brief period of time, CL for all user queries are
> violated, but are ACK’d as successful.
> - It’s easy to reproduce this race condition by simply adding a sleep here
> ([https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/service/StorageService.java#L2529-L2532])
> - If a cleanup is not ran before any range movement, it’s possible for rows
> that were temporarily written to the wrong node re-appear.
> - We tested that the race condition exists in our Cassandra 2.x fork (we're
> not on 3.x or 4.x). So, there is a possibility here that it's only for
> Cassandra 2.x, however unlikely from reading the code.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]