Ke Han created CASSANDRA-18897:
----------------------------------

             Summary: NPE in getHostId when fresh starting up the 3.11.16 
cluster
                 Key: CASSANDRA-18897
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18897
             Project: Cassandra
          Issue Type: Bug
          Components: Cluster/Gossip
            Reporter: Ke Han
         Attachments: persistent.tar.gz, system.log

When I start up cassandra 3.11.16 cluster (Fresh Start), I noticed the 
following NPE in the system log (Seed node).

 
{code:java}
INFO  [HANDSHAKE-/192.168.77.3] 2023-10-02 12:56:43,180 
OutboundTcpConnection.java:561 - Handshaking version with /192.168.77.3
ERROR [GossipStage:1] 2023-10-02 12:56:43,185 CassandraDaemon.java:244 - 
Exception in thread Thread[GossipStage:1,5,main]
java.lang.NullPointerException: null
        at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:998)
        at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:993)
        at 
org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2632)
        at 
org.apache.cassandra.service.StorageService.onChange(StorageService.java:2203)
        at 
org.apache.cassandra.service.StorageService.onJoin(StorageService.java:3020)
        at 
org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1246)
        at 
org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1358)
        at 
org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49)
        at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:69)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at 
org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84)
        at java.lang.Thread.run(Thread.java:750) {code}
 
h1. Reproduce

It's non-deterministic and manifest for 0.15% of cases (6 out of 4000 run).
h1. Root Cause

 
{code:java}
public UUID getHostId(InetAddress endpoint, Map<InetAddress, EndpointState> 
epStates)
{
    return 
UUID.fromString(epStates.get(endpoint).getApplicationState(ApplicationState.HOST_ID).value);
} {code}
It happens at a {*}fresh start{*}, so my assumption is that StorageService is 
trying to get the epState of a specific endpoint before the state is created. 
But I am not sure how it's caused exactly by thread interleaving. I assume this 
problem also exists in 4.x version since the logic for 
StorageService.handleStateNormal is almost the same.

Consequences: This NPE won't prevent the cluster from starting up/executing 
commands. It's probably benign.

I have attached the logs of my cluster. (system.log contains logs from the seed 
node, and persistent.tar.gz contains logs from all three nodes and the 
configuration).

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to