Ines Potier created CASSANDRA-18319:
---------------------------------------
Summary: Cassandra in Kubernetes: IP switch decommission issue
Key: CASSANDRA-18319
URL: https://issues.apache.org/jira/browse/CASSANDRA-18319
Project: Cassandra
Issue Type: Bug
Reporter: Ines Potier
We have recently encountered a recurring old IP reappearance issue while
testing decommissions on some of our Kubernetes Cassandra staging clusters.
*Issue Description*
In Kubernetes, a Cassandra node can change IP at each pod bounce. We have
noticed that this behavior, associated with a decommission operation, can get
the cluster into an erroneous state.
Consider the following situation: a Cassandra node {{node1}} , with
{{{}hostId1{}}}, owning 20.5% of the token ring, bounces and switches IP
({{{}old_IP{}}} → {{{}new_IP{}}}). After a couple gossip iterations, all other
nodes’ nodetool status output includes a {{new_IP}} UN entry owning 20.5% of
the token ring and no {{old_IP}} entry.
Shortly after the bounce, {{node1}} gets decommissioned. Our cluster does not
have a lot of data, and the decommission operation completes pretty quickly.
Logs on other nodes start showing acknowledgment that {{node1}} has left and
soon, nodetool status’ {{new_IP}} UL entry disappears. {{node1}} ‘s pod is
deleted.
After a minute delay, the cluster enters the erroneous state. An {{old_IP}} DN
entry reappears in nodetool status, owning 20.5% of the token ring. No node
owns this IP anymore and according to logs, {{old_IP}} is still associated with
{{{}hostId1{}}}.
*Issue Root Cause*
By digging through Cassandra logs, and re-testing this scenario over and over
again, we have reached the following conclusion:
* Other nodes will continue exchanging gossip about {{old_IP}} , even after it
becomes a fatClient.
* The fatClient timeout and subsequent quarantine does not stop {{old_IP}}
from reappearing in a node’s Gossip state, once its quarantine is over. We
believe that this is due to a misalignment on all nodes’ {{old_IP}} expiration
time.
* Once {{new_IP}} has left the cluster, and {{old_IP}} next gossip state
message is received by a node, StorageService will no longer face collisions
(or will, but with an even older IP) for {{hostId1}} and its corresponding
tokens. As a result, {{old_IP}} will regain ownership of 20.5% of the token
ring.
*Proposed fix*
Following the above investigation, we were thinking about implementing the
following fix:
When a node receives a gossip status change with {{STATE_LEFT}} for a leaving
endpoint {{{}new_IP{}}}, before evicting {{new_IP }}from the token ring, purge
from Gossip (ie {{{}evictFromMembership{}}}) all endpoints that meet the
following criteria:
* {{endpointStateMap}} contains this endpoint
* The endpoint is not currently a token owner
({{{}!tokenMetadata.isMember(endpoint){}}})
* The endpoint’s {{hostId}} matches the {{hostId}} of {{new_IP}}
* The endpoint is older than {{leaving_IP}}
({{{}Gossiper.instance.compareEndpointStartup{}}})
* The endpoint’s token range (from {{{}endpointStateMap{}}}) intersects with
{{{}new_IP{}}}’s
This modification’s intention is to force nodes to realign on {{old_IP}}
expiration, and expunge it from Gossip so it does not reappear after {{new_IP}}
leaves the ring.
Another approach we have also been considering is expunging {{old_IP}} at the
moment of the StorageService collision resolution.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]