[ 
https://issues.apache.org/jira/browse/CASSANDRA-13851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570067#comment-17570067
 ] 

Daniel Cranford commented on CASSANDRA-13851:
---------------------------------------------

[~samt], sorry, I miss-spoke. I appreciate the material improvement in behavior 
this ticket has provided. What I intended to say was
{quote}A node will note start unless it can contact a seed node *or* another 
node not also performing the shadow round{quote}

Background: my operations guys routinely perform a full cluster bounce to 
ensure everything is starting from a clean state. Up until Cassandra 3.6 this 
worked fine. Unfortunately, due to the details of our hardware, sometimes nodes 
take longer to come up than usual (eg 5 minutes instead of 30 seconds). If the 
slow nodes happen to be the seed node/nodes, it is game over - the cluster will 
not start.

The only way my ops guys were able to figure out how to resolve this was to 
give me the stack trace of the error, which I had to correlate with the source 
code and use `git blame` to find CASSANDRA-10134 and this ticket. I would not 
consider a bug tracker to be appropriate documentation for the semantics of a 
seed node, especially when the public docs state
{quote}The ring can operate or boot without a seed; however, you will not be 
able to add new nodes to the cluster.{quote}

My ops guys have worked around this behavior by begrudgingly setting 
`cassandra.allow_unsafe_joins=true` - an undocumented workaround I found by 
inspecting the source code. After we upgraded from 3.9 to 3.11, I was eager to 
see if this ticket allowed us to remove the workaround. Unfortunately it does 
not, since a full cluster bounce will still fail since only seed nodes and 
nodes not themselves in the shadow round can release a node from the shadow 
round.

If anything, the error message in this version is worse, since it is now 
incorrect. 
{code:java}
if (!isSeed)
    throw new RuntimeException("Unable to gossip with any peers");
{code}

actually, the node was unable to gossip with any seeds and any peers not 
themselves in the shadow round. Peers may be alive but themselves trapped in 
the shadow round.

> Allow existing nodes to use all peers in shadow round
> -----------------------------------------------------
>
>                 Key: CASSANDRA-13851
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13851
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/Startup and Shutdown
>            Reporter: Kurt Greaves
>            Assignee: Kurt Greaves
>            Priority: Normal
>             Fix For: 3.11.3, 4.0-alpha1, 4.0
>
>
> In CASSANDRA-10134 we made collision checks necessary on every startup. A 
> side-effect was introduced that then requires a nodes seeds to be contacted 
> on every startup. Prior to this change an existing node could start up 
> regardless whether it could contact a seed node or not (because 
> checkForEndpointCollision() was only called for bootstrapping nodes). 
> Now if a nodes seeds are removed/deleted/fail it will no longer be able to 
> start up until live seeds are configured (or itself is made a seed), even 
> though it already knows about the rest of the ring. This is inconvenient for 
> operators and has the potential to cause some nasty surprises and increase 
> downtime.
> One solution would be to use all a nodes existing peers as seeds in the 
> shadow round. Not a Gossip guru though so not sure of implications.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to