[
https://issues.apache.org/jira/browse/CASSANDRA-1160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12876489#action_12876489
]
Matthew F. Dennis commented on CASSANDRA-1160:
----------------------------------------------
I should have been more clear about that. The failures in
StorageServiceClientTest are because files from
DatabaseDescriptor.getAllDataFileLocations() exist (the test asserts they do
not). This is because when I moved start into <init> I passed the same
generation number that that initServer() uses, but initClient (used by the
test) passes a generation number based on currentTimeMillis() which doesn't
cause the file paths to exist. It wasn't clear to me there was an easy way to
determine what generation to use.
More importantly, this lead me to notice that both StorageService and
StorageLoadBalancer are registered with Gossiper before start is called. If
start was moved into <init> I was afraid of introducing new race conditions
(e.g. Gossiper starting up, doing things that would have resulted in
publications and then later receiving the registration for such publications).
I agree that Gossiper is likely not being spun up correctly. I think this
applies to other things as well (in general things seem to have many
interdependencies on boot. I don't have anything specific to point at, but it
feels like there are several race conditions that just happen to usually not be
exhibited).
One solution I've used in the past that I like for this in general is that none
of the singletons are inited in their own classes, but only by a
BootInitializer of sorts that completely controls when the objects are
inited/started/created/etc, in particular controlling the order things are
done. A single entry point into the system if you will. This seems like some
work from the point we're at now and probably a bit unnecessary.
Along those lines I had a similar patch that I didn't submit that took a third
argument, List<IEndpointStateChangeSubsciber> initialSubscribers. Each
initialSubscriber was registered before the existing code in start was called.
initServer built a list of the StorageService and StorageLoadBalance instances
and passed that to start to ensure none of the publications were missed by
either service. Like the 0001-cassandra-0.6-1160.patch references to .instance
were replaced with getInstance() calls which busy waited (via yield) until
start completed but this meant that StorageService was managing
StorageLoadBalancer registrations and that didn't feel right, hence
preRegistrations() (so each could manager their own registrations).
So, basically two issues/questions if Gossiper.start moves to Gossiper.<init>:
1) how does Gossiper.<init> get the correct generation number (server v client)?
2) what about missed publications because Gossiper was started before
StorageService and/or StorageLoadBalancer got a chance to register?
> race with insufficiently constructed Gossiper
> ---------------------------------------------
>
> Key: CASSANDRA-1160
> URL: https://issues.apache.org/jira/browse/CASSANDRA-1160
> Project: Cassandra
> Issue Type: Bug
> Reporter: Jonathan Ellis
> Assignee: Matthew F. Dennis
> Priority: Minor
> Fix For: 0.6.3
>
> Attachments: 0001-cassandra-0.6-1160.patch
>
>
> Gossiper.start needs to be integrated into the constructor. Currently you
> can have threads using the gossiper instance before start finishes (or even
> starts?), resulting in tracebacks like this:
> ERROR [GMFD:1] 2010-06-02 10:45:49,878 CassandraDaemon.java (line 78) Fatal
> exception in thread Thread[GMFD:1,5,main]
> java.lang.AssertionError
> at org.apache.cassandra.net.Header.<init>(Header.java:56)
> at org.apache.cassandra.net.Header.<init>(Header.java:74)
> at org.apache.cassandra.net.Message.<init>(Message.java:58)
> at
> org.apache.cassandra.gms.Gossiper.makeGossipDigestAckMessage(Gossiper.java:294)
> at
> org.apache.cassandra.gms.Gossiper$GossipDigestSynVerbHandler.doVerb(Gossiper.java:935)
> at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:40)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
> ERROR [GMFD:2] 2010-06-02 10:45:49,880 CassandraDaemon.java (line 78) Fatal
> exception in thread Thread[GMFD:2,5,main]
> java.lang.AssertionError
> at org.apache.cassandra.net.Header.<init>(Header.java:56)
> at org.apache.cassandra.net.Header.<init>(Header.java:74)
> at org.apache.cassandra.net.Message.<init>(Message.java:58)
> at
> org.apache.cassandra.gms.Gossiper.makeGossipDigestAckMessage(Gossiper.java:294)
> at
> org.apache.cassandra.gms.Gossiper$GossipDigestSynVerbHandler.doVerb(Gossiper.java:935)
> at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:40)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.java:619)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.