[
https://issues.apache.org/jira/browse/CASSANDRA-10089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14964380#comment-14964380
]
Stefania commented on CASSANDRA-10089:
--------------------------------------
We managed to reproduce the issue of missing tokens in status normal again with
[this failed
test|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-10089-2.2-dtest/lastCompletedBuild/testReport/consistency_test/TestConsistency/short_read_reversed_test/]
and log information at TRACE level for Gossiper. I've replaced the files
attached to this ticket with the log files for this latest test. The ERROR
occurs in node 1 because it gets status NORMAL but no tokens for node 2 from
node 2, at around 09:31:16,086.
The problem is the high scale lib {{NonBlockingHashMap}} in {{EndpointState}}.
Even if we are careful to add the tokens before the status, sometimes the
gossip thread gets status normal but no tokens. I've reproduced this several
times on my machine with [this unit
test|https://github.com/stef1927/cassandra/commit/275564fa568f47bb136c13e38ad918c4c4fcb944#diff-9c186d237f8b9eda310c20fc4a8c314bR41].
I'm not so sure if it's OK to replace {{NonBlockingHashMap}} with
{{ConcurrentHashMap}} since this would have performance impacts. Alternatively
we could see if there is a later version of {{NonBlockingHashMap}} or a
different implementation of a hash map that is thread safe and that guarantees
that if we see a value when iterating, then we see all values inserted or
modified before this value. cc [~brandon.williams] for his knowledge on Gossip
and [~benedict] for his knowledge on hash map implementations.
> NullPointerException in Gossip handleStateNormal
> ------------------------------------------------
>
> Key: CASSANDRA-10089
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10089
> Project: Cassandra
> Issue Type: Bug
> Reporter: Stefania
> Assignee: Stefania
> Fix For: 2.1.x, 2.2.x, 3.0.x
>
> Attachments: node1_debug.log, node2_debug.log, node3_debug.log
>
>
> Whilst comparing dtests for CASSANDRA-9970 I found [this failing
> dtest|http://cassci.datastax.com/view/Dev/view/blerer/job/blerer-9970-dtest/lastCompletedBuild/testReport/consistency_test/TestConsistency/short_read_test/]
> in 2.2:
> {code}
> Unexpected error in node1 node log: ['ERROR [GossipStage:1] 2015-08-14
> 15:39:57,873 CassandraDaemon.java:183 - Exception in thread
> Thread[GossipStage:1,5,main] java.lang.NullPointerException: null \tat
> org.apache.cassandra.service.StorageService.getApplicationStateValue(StorageService.java:1731)
> ~[main/:na] \tat
> org.apache.cassandra.service.StorageService.getTokensFor(StorageService.java:1804)
> ~[main/:na] \tat
> org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:1857)
> ~[main/:na] \tat
> org.apache.cassandra.service.StorageService.onChange(StorageService.java:1629)
> ~[main/:na] \tat
> org.apache.cassandra.service.StorageService.onJoin(StorageService.java:2312)
> ~[main/:na] \tat
> org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1025)
> ~[main/:na] \tat
> org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1106)
> ~[main/:na] \tat
> org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49)
> ~[main/:na] \tat
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66)
> ~[main/:na] \tat
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> ~[na:1.7.0_80] \tat
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> ~[na:1.7.0_80] \tat java.lang.Thread.run(Thread.java:745) ~[na:1.7.0_80]']
> {code}
> I wasn't able to find it on unpatched branches but it is clearly not related
> to CASSANDRA-9970, if anything it could have been a side effect of
> CASSANDRA-9871.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)