David Capwell created CASSANDRA-18913:
-----------------------------------------
Summary: Gossip NPE due to shutdown event corrupting empty statuses
Key: CASSANDRA-18913
URL: https://issues.apache.org/jira/browse/CASSANDRA-18913
Project: Cassandra
Issue Type: Bug
Components: Cluster/Gossip, Cluster/Membership
Reporter: David Capwell
Assignee: David Capwell
When an instance either disables gossip or shuts down we send a gossip shutdown
message, peers ignore it if the endpoint isn’t known, else it mutates its local
copy of the state to mark shutdown…
When an instance restarts it populates gossip with the endpoints found in
peers, but the state is empty (not null)
So, there is a fun timing bug…
* stop node1
* start node1; at this point all known endpoints before exist in gossip but are
empty
* node2 shutdown (gossip shutdown or node, doesn’t matter)
* node1 sees the shutdown before gossip messages, and gets corruptted
* node3 tries to join the cluster, fails due to node1 being corrupted
There are 2 different patterns the NPE can happen with, in this example node1
and node3 will have different stack traces
{code}
org.apache.cassandra.distributed.shared.ShutdownException: Uncaught exceptions
were thrown during test
Suppressed: java.lang.NullPointerException: Unable to get HOST_ID;
HOST_ID is not defined, given EndpointState: HeartBeatState = HeartBeat:
generation = 0, version = 2147483647, AppStateMap =
{STATUS=Value(shutdown,true,37), RPC_READY=Value(false,38),
STATUS_WITH_PORT=Value(shutdown,true,36)}
at
org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1218)
at
org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1208)
at
org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:3279)
at
org.apache.cassandra.service.StorageService.onChange(StorageService.java:2756)
at
org.apache.cassandra.gms.Gossiper.markAsShutdown(Gossiper.java:611)
at
org.apache.cassandra.gms.GossipShutdownVerbHandler.doVerb(GossipShutdownVerbHandler.java:39)
at
org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)
Suppressed: java.lang.NullPointerException: Unable to get HOST_ID;
HOST_ID is not defined, given EndpointState: HeartBeatState = HeartBeat:
generation = 0, version = 2147483647, AppStateMap =
{STATUS=Value(shutdown,true,37), RPC_READY=Value(false,38),
STATUS_WITH_PORT=Value(shutdown,true,36)}
at
org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1218)
at
org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1208)
at
org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:3279)
at
org.apache.cassandra.service.StorageService.onChange(StorageService.java:2756)
at
org.apache.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1762)
at
org.apache.cassandra.service.StorageService.onJoin(StorageService.java:3793)
at
org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1465)
at
org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1678)
at
org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:50)
at
org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]