[jira] [Commented] (CASSANDRA-20033) Shutdown message doesn't have generation check causing normal node considered shutdown by other nodes in cluster

Brandon Williams (Jira) Sat, 26 Oct 2024 04:13:45 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-20033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893055#comment-17893055
 ]


Brandon Williams commented on CASSANDRA-20033:
----------------------------------------------

bq. Assuming we have a very large delay in network

That is quite a delay!  But this does all make sense to me and adding a 
generation check sounds like the correct solution.

> Shutdown message doesn't have generation check causing normal node considered 
> shutdown by other nodes in cluster
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-20033
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-20033
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Cluster/Gossip
>            Reporter: Runtian Liu
>            Priority: Normal
>             Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>
> Recently we run into one issue that when we do rolling restart of a cluster. 
> We found one node is UN in it's own gossip state but all other nodes in the 
> same cluster are considering it as a DN node.
> Gossip info from the node that is being considered as down by other nodes:
>  
> {code:java}
> /1.1.1.1
>   generation:1729812724
>   heartbeat:20676
>   STATUS:26:NORMAL,-1215648874011476782
>   LOAD:20620:8.030878944E10
>   SCHEMA:41:e3b3d9f6-e2e2-307b-959e-b493bd1f7bef
>   DC:13:dc4
>   RACK:15:dc4-0
>   RELEASE_VERSION:6:4.1.3
>   INTERNAL_IP:11:1.1.1.1
>   RPC_ADDRESS:5:1.1.1.1
>   NET_VERSION:2:12
>   HOST_ID:3:65a87fef-e7f8-41f4-8cdf-9d8ea4f5e4f0
>   RPC_READY:161:true
>   INTERNAL_ADDRESS_AND_PORT:9:1.1.1.1:27378
>   NATIVE_ADDRESS_AND_PORT:4:1.1.1.1:27379
>   STATUS_WITH_PORT:25:NORMAL,-1215648874011476782
>   SSTABLE_VERSIONS:7:big-nb
>   TOKENS:24:<hidden> {code}
> Gossip state from other nodes for this node:
>  
> {code:java}
> /1.1.1.1
>   generation:1729812724
>   heartbeat:2147483647
>   STATUS:332020:shutdown,true
>   LOAD:30:8.032911052E10
>   SCHEMA:19:e3b3d9f6-e2e2-307b-959e-b493bd1f7bef
>   DC:13:dc4
>   RACK:15:dc4-0
>   RELEASE_VERSION:6:4.1.3
>   INTERNAL_IP:11:1.1.1.1
>   RPC_ADDRESS:5:1.1.1.1
>   NET_VERSION:2:12
>   HOST_ID:3:65a87fef-e7f8-41f4-8cdf-9d8ea4f5e4f0
>   RPC_READY:332021:false
>   INTERNAL_ADDRESS_AND_PORT:9:1.1.1.1:27378
>   NATIVE_ADDRESS_AND_PORT:4:1.1.1.1:27379
>   STATUS_WITH_PORT:332020:shutdown,true
>   SSTABLE_VERSIONS:7:big-nb
>   TOKENS:24:<hidden> {code}
> The share the same generation but the other nodes are considering the node is 
> shutdown.
>  
>  
> After closer look into the problem, I think here's what happened.
> When the node get restarted,
> 1. it first got gracefully shutdown, it will broadcast the GOSSIP_SHUTDOWN 
> message to the rest of the cluster.
> 2. When it get back up, it will try to update it's generation and gossip with 
> other nodes.
>  
> If one node get the new generation for this 1.1.1.1 node first, then it 
> receive the GOSSIP_SHUTDOWN message from step 1 (Assuming we have a very 
> large delay in network between 1.1.1.1 node and the bad receiver node). We 
> will run into above situation.
>  
> I think GOSSIP_SHUTDOWN message should have the generation information and 
> GossipShutdownVerbHandler should only bump the heartbeat for the local state 
> if generation is same. If local generation is higher, we should ignore the 
> shutdown message. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-20033) Shutdown message doesn't have generation check causing normal node considered shutdown by other nodes in cluster

Reply via email to