[ 
https://issues.apache.org/jira/browse/CASSANDRA-8072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14295134#comment-14295134
 ] 

Ryan Springer commented on CASSANDRA-8072:
------------------------------------------

No problem with all the questions.  The more information we have on this issue, 
the better.

First Startup:

- The DSC deb/rpm packages are installed by the agent.  Part of the scripts in 
the deb/rpm automatically starts DSC when the package is installed.
- No changes are made to cassandra.yaml before this initial start from the 
packaged scripts.
- Initially the nodes are not aware of each other as seeds, because the 
cassandra.yaml being used is the one from the package.
- The initial install is made in parallel in batches of 20 nodes at a time ( 
configurable with the Opscenter install_throttle parameter.  )  However, I am 
seeing the problem with just 2 nodes in the cluster, so I don't think the 
throttle is involved.
- I will do a run of 2 nodes and post the cassandra.yaml files.

Stopping:

- The nodes are stopped in parallel
- It looks as though Opscenter waits for the "apt-get install" or equivalent 
rpm command to return from the DSC package installation and then Opscenter 
considers the node to be initially started.  Once the package install commands 
have finished for all nodes, then Opscenter begins to stop all of the DSC 
instances.  If the package install command returns before DSC is completely 
initialized, that could be related to this issue.
- The nodes are stopped with: pkill -f CassandraDaemon

Starting again

- The DSC nodes are restarted serially, with the seed nodes being started 
before non-seed nodes.  The seeds are first sorted by string comparison and 
then started one at a time in that order.
- Opscenter will wait for all DSC instances to have been started, then it will 
restart the agents, wait for them to reconnect to Opscenter, and then Opscenter 
considers the provisioning to be finished.
- I will grab 2 cassandra.yaml configs for this stage as well.

>From my reading of the code, I believe the ec2 nodes will refer to each other 
>using public IPs, but I will verify from a real run.

> Exception during startup: Unable to gossip with any seeds
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-8072
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8072
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Ryan Springer
>            Assignee: Brandon Williams
>         Attachments: casandra-system-log-with-assert-patch.log
>
>
> When Opscenter 4.1.4 or 5.0.1 tries to provision a 2-node DSC 2.0.10 cluster 
> in either ec2 or locally, an error occurs sometimes with one of the nodes 
> refusing to start C*.  The error in the /var/log/cassandra/system.log is:
> ERROR [main] 2014-10-06 15:54:52,292 CassandraDaemon.java (line 513) 
> Exception encountered during startup
> java.lang.RuntimeException: Unable to gossip with any seeds
>         at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1200)
>         at 
> org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:444)
>         at 
> org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:655)
>         at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:609)
>         at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:502)
>         at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:378)
>         at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496)
>         at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585)
>  INFO [StorageServiceShutdownHook] 2014-10-06 15:54:52,326 Gossiper.java 
> (line 1279) Announcing shutdown
>  INFO [StorageServiceShutdownHook] 2014-10-06 15:54:54,326 
> MessagingService.java (line 701) Waiting for messaging service to quiesce
>  INFO [ACCEPT-localhost/127.0.0.1] 2014-10-06 15:54:54,327 
> MessagingService.java (line 941) MessagingService has terminated the accept() 
> thread
> This errors does not always occur when provisioning a 2-node cluster, but 
> probably around half of the time on only one of the nodes.  I haven't been 
> able to reproduce this error with DSC 2.0.9, and there have been no code or 
> definition file changes in Opscenter.
> I can reproduce locally with the above steps.  I'm happy to test any proposed 
> fixes since I'm the only person able to reproduce reliably so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to