[
https://issues.apache.org/jira/browse/CASSANDRA-13851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570116#comment-17570116
]
Sam Tunnicliffe commented on CASSANDRA-13851:
---------------------------------------------
The first thing I would say is that if you routinely shutdown the entire
cluster, you're going to experience unavailability. It _shouldn't_ be necessary
to periodically/preemptively bounce nodes, but I appreciate that sometimes a
restart is the simplest solution. If that's the case, is there a reason why it
couldn't be done via a rolling bounce? That's something that many ops teams do
regularly when deploying configuration changes and the like.
An alternative to `cassandra.allow_unsafe_joins` would be to bump `RING_DELAY`.
As described in [this
comment|https://issues.apache.org/jira/browse/CASSANDRA-13851?focusedCommentId=16309845&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16309845],
after that period, seed nodes will exit the SR whether they receive a response
or not and so will be able to respond to the non-seed nodes, which remain in
the SR for `RING_DELAY * 2`.
Lastly, I'd have to respectfully disagree with you about the accuracy of the
error message. Just because the node was able to send and receive messages with
some peer(s), it was unable to gossip with them due to constraints inherent in
the gossip contract. i.e. Nodes in SR don't respond to peers SR requests.
Perhaps we should expand on that though to include some further info on how to
remedy the situation (or at least where to look).
> Allow existing nodes to use all peers in shadow round
> -----------------------------------------------------
>
> Key: CASSANDRA-13851
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13851
> Project: Cassandra
> Issue Type: Bug
> Components: Local/Startup and Shutdown
> Reporter: Kurt Greaves
> Assignee: Kurt Greaves
> Priority: Normal
> Fix For: 3.11.3, 4.0-alpha1, 4.0
>
>
> In CASSANDRA-10134 we made collision checks necessary on every startup. A
> side-effect was introduced that then requires a nodes seeds to be contacted
> on every startup. Prior to this change an existing node could start up
> regardless whether it could contact a seed node or not (because
> checkForEndpointCollision() was only called for bootstrapping nodes).
> Now if a nodes seeds are removed/deleted/fail it will no longer be able to
> start up until live seeds are configured (or itself is made a seed), even
> though it already knows about the rest of the ring. This is inconvenient for
> operators and has the potential to cause some nasty surprises and increase
> downtime.
> One solution would be to use all a nodes existing peers as seeds in the
> shadow round. Not a Gossip guru though so not sure of implications.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]