[
https://issues.apache.org/jira/browse/CASSANDRA-13851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570067#comment-17570067
]
Daniel Cranford commented on CASSANDRA-13851:
---------------------------------------------
[~samt], sorry, I miss-spoke. I appreciate the material improvement in behavior
this ticket has provided. What I intended to say was
{quote}A node will note start unless it can contact a seed node *or* another
node not also performing the shadow round{quote}
Background: my operations guys routinely perform a full cluster bounce to
ensure everything is starting from a clean state. Up until Cassandra 3.6 this
worked fine. Unfortunately, due to the details of our hardware, sometimes nodes
take longer to come up than usual (eg 5 minutes instead of 30 seconds). If the
slow nodes happen to be the seed node/nodes, it is game over - the cluster will
not start.
The only way my ops guys were able to figure out how to resolve this was to
give me the stack trace of the error, which I had to correlate with the source
code and use `git blame` to find CASSANDRA-10134 and this ticket. I would not
consider a bug tracker to be appropriate documentation for the semantics of a
seed node, especially when the public docs state
{quote}The ring can operate or boot without a seed; however, you will not be
able to add new nodes to the cluster.{quote}
My ops guys have worked around this behavior by begrudgingly setting
`cassandra.allow_unsafe_joins=true` - an undocumented workaround I found by
inspecting the source code. After we upgraded from 3.9 to 3.11, I was eager to
see if this ticket allowed us to remove the workaround. Unfortunately it does
not, since a full cluster bounce will still fail since only seed nodes and
nodes not themselves in the shadow round can release a node from the shadow
round.
If anything, the error message in this version is worse, since it is now
incorrect.
{code:java}
if (!isSeed)
throw new RuntimeException("Unable to gossip with any peers");
{code}
actually, the node was unable to gossip with any seeds and any peers not
themselves in the shadow round. Peers may be alive but themselves trapped in
the shadow round.
> Allow existing nodes to use all peers in shadow round
> -----------------------------------------------------
>
> Key: CASSANDRA-13851
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13851
> Project: Cassandra
> Issue Type: Bug
> Components: Local/Startup and Shutdown
> Reporter: Kurt Greaves
> Assignee: Kurt Greaves
> Priority: Normal
> Fix For: 3.11.3, 4.0-alpha1, 4.0
>
>
> In CASSANDRA-10134 we made collision checks necessary on every startup. A
> side-effect was introduced that then requires a nodes seeds to be contacted
> on every startup. Prior to this change an existing node could start up
> regardless whether it could contact a seed node or not (because
> checkForEndpointCollision() was only called for bootstrapping nodes).
> Now if a nodes seeds are removed/deleted/fail it will no longer be able to
> start up until live seeds are configured (or itself is made a seed), even
> though it already knows about the rest of the ring. This is inconvenient for
> operators and has the potential to cause some nasty surprises and increase
> downtime.
> One solution would be to use all a nodes existing peers as seeds in the
> shadow round. Not a Gossip guru though so not sure of implications.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]