[
https://issues.apache.org/jira/browse/CASSANDRA-13407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15959301#comment-15959301
]
Joel Knighton commented on CASSANDRA-13407:
-------------------------------------------
For posterity, this is the race possible when the Gossiper is started, as far
as I can tell.
In setup, we initialize a fake ring using Util.createInitialRing. This will
intialize the nodes in an unsafe manner and then inject the token states. If a
status check runs before the tokens state is set, the previously decommissioned
node will look like a fat client, since it won't have tokens and will not have
a DEAD_STATE. Since we aren't gossiping, we won't have heard from it in greater
than fatClientTimeout, so we'll remove it. If this races with the ss.onChange
in createInitialRing, we can remove the endpointstate while processing it,
which will cause a NPE as above. This race can be seen at 16:15:51,205 in the
log linked from the test failure.
We also need to remove SchemaLoader.loadSchema() as you did in the patch - this
is because it starts the Gossiper as well. This is fine; we don't appear to
need it.
The patch looks good - the race exists in theory on 2.1/2.2, but it appears to
only manifest on 3.0+. I don't think it is worth committing to 2.1 for that
reason - let's do 2.2+ forward and run the test at least once on each branch
before committing.
> test failure at RemoveTest.testBadHostId
> ----------------------------------------
>
> Key: CASSANDRA-13407
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13407
> Project: Cassandra
> Issue Type: Bug
> Reporter: Alex Petrov
> Assignee: Alex Petrov
>
> Example trace:
> {code}
> java.lang.NullPointerException
> at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:881)
> at org.apache.cassandra.gms.Gossiper.getHostId(Gossiper.java:876)
> at
> org.apache.cassandra.service.StorageService.handleStateNormal(StorageService.java:2201)
> at
> org.apache.cassandra.service.StorageService.onChange(StorageService.java:1855)
> at org.apache.cassandra.Util.createInitialRing(Util.java:216)
> at org.apache.cassandra.service.RemoveTest.setup(RemoveTest.java:89)
> {code}
> [failure
> example|https://cassci.datastax.com/job/trunk_testall/1491/testReport/org.apache.cassandra.service/RemoveTest/testBadHostId/]
> [history|https://cassci.datastax.com/job/trunk_testall/lastCompletedBuild/testReport/org.apache.cassandra.service/RemoveTest/testBadHostId/history/]
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)