Ke Han created CASSANDRA-18897: ---------------------------------- Summary: NPE in getHostId when fresh starting up the 3.11.16 cluster Key: CASSANDRA-18897 URL: https://issues.apache.org/jira/browse/CASSANDRA-18897 Project: Cassandra Issue Type: Bug Components: Cluster/Gossip Reporter: Ke Han Attachments: persistent.tar.gz, system.log
When I start up cassandra 3.11.16 cluster (Fresh Start), I noticed the following NPE in the system log (Seed node). {code:java} INFO [HANDSHAKE-/192.168.77.3] 2023-10-02 12:56:43,180 OutboundTcpConnection.java:561 - Handshaking version with /192.168.77.3 ERROR [GossipStage:1] 2023-10-02 12:56:43,185 CassandraDaemon.java:244 - Exception in thread Thread[GossipStage:1,5,main] java.lang.NullPointerException: null at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:998) at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:993) at org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2632) at org.apache.cassandra.service.StorageService.onChange(StorageService.java:2203) at org.apache.cassandra.service.StorageService.onJoin(StorageService.java:3020) at org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1246) at org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1358) at org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:69) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84) at java.lang.Thread.run(Thread.java:750) {code} h1. Reproduce It's non-deterministic and manifest for 0.15% of cases (6 out of 4000 run). h1. Root Cause {code:java} public UUID getHostId(InetAddress endpoint, Map<InetAddress, EndpointState> epStates) { return UUID.fromString(epStates.get(endpoint).getApplicationState(ApplicationState.HOST_ID).value); } {code} It happens at a {*}fresh start{*}, so my assumption is that StorageService is trying to get the epState of a specific endpoint before the state is created. But I am not sure how it's caused exactly by thread interleaving. I assume this problem also exists in 4.x version since the logic for StorageService.handleStateNormal is almost the same. Consequences: This NPE won't prevent the cluster from starting up/executing commands. It's probably benign. I have attached the logs of my cluster. (system.log contains logs from the seed node, and persistent.tar.gz contains logs from all three nodes and the configuration). -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org