David Capwell created CASSANDRA-18913:
-----------------------------------------

             Summary: Gossip NPE due to shutdown event corrupting empty statuses
                 Key: CASSANDRA-18913
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18913
             Project: Cassandra
          Issue Type: Bug
          Components: Cluster/Gossip, Cluster/Membership
            Reporter: David Capwell
            Assignee: David Capwell


When an instance either disables gossip or shuts down we send a gossip shutdown 
message, peers ignore it if the endpoint isn’t known, else it mutates its local 
copy of the state to mark shutdown…
When an instance restarts it populates gossip with the endpoints found in 
peers, but the state is empty (not null)

So, there is a fun timing bug…

* stop node1
* start node1; at this point all known endpoints before exist in gossip but are 
empty
* node2 shutdown (gossip shutdown or node, doesn’t matter)
* node1 sees the shutdown before gossip messages, and gets corruptted
* node3 tries to join the cluster, fails due to node1 being corrupted

There are 2 different patterns the NPE can happen with, in this example node1 
and node3 will have different stack traces

{code}
org.apache.cassandra.distributed.shared.ShutdownException: Uncaught exceptions 
were thrown during test
        Suppressed: java.lang.NullPointerException: Unable to get HOST_ID; 
HOST_ID is not defined, given EndpointState: HeartBeatState = HeartBeat: 
generation = 0, version = 2147483647, AppStateMap = 
{STATUS=Value(shutdown,true,37), RPC_READY=Value(false,38), 
STATUS_WITH_PORT=Value(shutdown,true,36)}
                at 
org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1218)
                at 
org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1208)
                at 
org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:3279)
                at 
org.apache.cassandra.service.StorageService.onChange(StorageService.java:2756)
                at 
org.apache.cassandra.gms.Gossiper.markAsShutdown(Gossiper.java:611)
                at 
org.apache.cassandra.gms.GossipShutdownVerbHandler.doVerb(GossipShutdownVerbHandler.java:39)
                at 
org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)
        Suppressed: java.lang.NullPointerException: Unable to get HOST_ID; 
HOST_ID is not defined, given EndpointState: HeartBeatState = HeartBeat: 
generation = 0, version = 2147483647, AppStateMap = 
{STATUS=Value(shutdown,true,37), RPC_READY=Value(false,38), 
STATUS_WITH_PORT=Value(shutdown,true,36)}
                at 
org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1218)
                at 
org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1208)
                at 
org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:3279)
                at 
org.apache.cassandra.service.StorageService.onChange(StorageService.java:2756)
                at 
org.apache.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1762)
                at 
org.apache.cassandra.service.StorageService.onJoin(StorageService.java:3793)
                at 
org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1465)
                at 
org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1678)
                at 
org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:50)
                at 
org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)

{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to