[
https://issues.apache.org/jira/browse/CASSANDRA-18913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773748#comment-17773748
]
David Capwell commented on CASSANDRA-18913:
-------------------------------------------
Starting commit
CI Results (pending):
||Branch||Source||Circle CI||Jenkins||
|cassandra-4.0|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-18913-cassandra-4.0-5C2AE581-B002-4057-8281-7C893EB64CF9]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-18913-cassandra-4.0-5C2AE581-B002-4057-8281-7C893EB64CF9]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/2614/]|
|cassandra-4.1|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-18913-cassandra-4.1-5C2AE581-B002-4057-8281-7C893EB64CF9]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-18913-cassandra-4.1-5C2AE581-B002-4057-8281-7C893EB64CF9]|[build|unknown]|
|cassandra-5.0|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-18913-cassandra-5.0-5C2AE581-B002-4057-8281-7C893EB64CF9]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-18913-cassandra-5.0-5C2AE581-B002-4057-8281-7C893EB64CF9]|[build|unknown]|
|trunk|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-18913-trunk-5C2AE581-B002-4057-8281-7C893EB64CF9]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-18913-trunk-5C2AE581-B002-4057-8281-7C893EB64CF9]|[build|unknown]|
> Gossip NPE due to shutdown event corrupting empty statuses
> ----------------------------------------------------------
>
> Key: CASSANDRA-18913
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18913
> Project: Cassandra
> Issue Type: Bug
> Components: Cluster/Gossip, Cluster/Membership
> Reporter: David Capwell
> Assignee: David Capwell
> Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
> Time Spent: 50m
> Remaining Estimate: 0h
>
> When an instance either disables gossip or shuts down we send a gossip
> shutdown message, peers ignore it if the endpoint isn’t known, else it
> mutates its local copy of the state to mark shutdown…
> When an instance restarts it populates gossip with the endpoints found in
> peers, but the state is empty (not null)
> So, there is a fun timing bug…
> * stop node1
> * start node1; at this point all known endpoints before exist in gossip but
> are empty
> * node2 shutdown (gossip shutdown or node, doesn’t matter)
> * node1 sees the shutdown before gossip messages, and gets corruptted
> * node3 tries to join the cluster, fails due to node1 being corrupted
> There are 2 different patterns the NPE can happen with, in this example node1
> and node3 will have different stack traces
> {code}
> org.apache.cassandra.distributed.shared.ShutdownException: Uncaught
> exceptions were thrown during test
> Suppressed: java.lang.NullPointerException: Unable to get HOST_ID;
> HOST_ID is not defined, given EndpointState: HeartBeatState = HeartBeat:
> generation = 0, version = 2147483647, AppStateMap =
> {STATUS=Value(shutdown,true,37), RPC_READY=Value(false,38),
> STATUS_WITH_PORT=Value(shutdown,true,36)}
> at
> org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1218)
> at
> org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1208)
> at
> org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:3279)
> at
> org.apache.cassandra.service.StorageService.onChange(StorageService.java:2756)
> at
> org.apache.cassandra.gms.Gossiper.markAsShutdown(Gossiper.java:611)
> at
> org.apache.cassandra.gms.GossipShutdownVerbHandler.doVerb(GossipShutdownVerbHandler.java:39)
> at
> org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)
> Suppressed: java.lang.NullPointerException: Unable to get HOST_ID;
> HOST_ID is not defined, given EndpointState: HeartBeatState = HeartBeat:
> generation = 0, version = 2147483647, AppStateMap =
> {STATUS=Value(shutdown,true,37), RPC_READY=Value(false,38),
> STATUS_WITH_PORT=Value(shutdown,true,36)}
> at
> org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1218)
> at
> org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:1208)
> at
> org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:3279)
> at
> org.apache.cassandra.service.StorageService.onChange(StorageService.java:2756)
> at
> org.apache.cassandra.gms.Gossiper.doOnChangeNotifications(Gossiper.java:1762)
> at
> org.apache.cassandra.service.StorageService.onJoin(StorageService.java:3793)
> at
> org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1465)
> at
> org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1678)
> at
> org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:50)
> at
> org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]