[
https://issues.apache.org/jira/browse/CASSANDRA-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292325#comment-14292325
]
Russ Hatch commented on CASSANDRA-8072:
---------------------------------------
[~respringer] Can you help me better understand what's happening with the nodes
(via opscenter) when this occurs? From what we talked about before it sounds
like the nodes are started with little/no configuration and then are stopped,
configured, and started up again with seeds configured.
First startup:
- are any changes made to cassandra.yaml before this start?
- are nodes aware of each other as seeds?
- does the start happen serially or in parallel?
- is it possible to get sample config yaml for 2 nodes at this phase?
Stopping:
- are the nodes stopped serially or in parallel?
- are the nodes able to complete startup before being stopped, or could they
be getting interrupted during initial start?
- are the nodes stopped forcefully (like kill -9) or something nicer?
Starting again with configurations completed:
- are nodes started serially or in parallel? (from what I know this would be
parallel but just want to be sure)
- will this startup step wait for all nodes to be ready before launching in
parallel? (or could one node get a significant head start if it completes the
earlier steps first?)
- is it possible to get sample config yaml for 2 nodes at this point? (I
grabbed yaml from a failed repro attempt but want to be sure I didn't get
something wrong in that attempt)
Finally, when provisioning in this way how do the (ec2) nodes refer to one
another: public ip, private ip, or private dns?
Thanks! And sorry for all the questions. I'm trying to close in on the issue
and still having difficulty reproducing in a local container environment, so
I'm trying to figure out what could be unique about the provisioning of these
nodes that may account for triggering this issue.
> Exception during startup: Unable to gossip with any seeds
> ---------------------------------------------------------
>
> Key: CASSANDRA-8072
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8072
> Project: Cassandra
> Issue Type: Bug
> Reporter: Ryan Springer
> Assignee: Brandon Williams
> Attachments: casandra-system-log-with-assert-patch.log
>
>
> When Opscenter 4.1.4 or 5.0.1 tries to provision a 2-node DSC 2.0.10 cluster
> in either ec2 or locally, an error occurs sometimes with one of the nodes
> refusing to start C*. The error in the /var/log/cassandra/system.log is:
> ERROR [main] 2014-10-06 15:54:52,292 CassandraDaemon.java (line 513)
> Exception encountered during startup
> java.lang.RuntimeException: Unable to gossip with any seeds
> at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1200)
> at
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:444)
> at
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:655)
> at
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:609)
> at
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:502)
> at
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:378)
> at
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496)
> at
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585)
> INFO [StorageServiceShutdownHook] 2014-10-06 15:54:52,326 Gossiper.java
> (line 1279) Announcing shutdown
> INFO [StorageServiceShutdownHook] 2014-10-06 15:54:54,326
> MessagingService.java (line 701) Waiting for messaging service to quiesce
> INFO [ACCEPT-localhost/127.0.0.1] 2014-10-06 15:54:54,327
> MessagingService.java (line 941) MessagingService has terminated the accept()
> thread
> This errors does not always occur when provisioning a 2-node cluster, but
> probably around half of the time on only one of the nodes. I haven't been
> able to reproduce this error with DSC 2.0.9, and there have been no code or
> definition file changes in Opscenter.
> I can reproduce locally with the above steps. I'm happy to test any proposed
> fixes since I'm the only person able to reproduce reliably so far.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)