[
https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17930471#comment-17930471
]
Brandon Williams commented on CASSANDRA-19580:
----------------------------------------------
bq. The problem is you can't do the replacement if for any reason the node ends
up in hibernate state. It is forever stuck in 'Unable to contact any seeds!'
error, every attempt at replacement results in that error.
You are correct, the node being in the hibernate state is the crux of the
problem here.
bq. I do not know what the correct solution is to this, there seems to be many
possible approaches to fix.
With gossip, there are always a few possibilities. The trick is finding the
least invasive/impactful one so some other edge case isn't broken :)
bq. I don't understand why responses to SYN do not include state for nodes
that are not in the digest list
That is part of how the implementation is optimized to save bandwidth (whether
that is a good idea or not.) I wonder if CASSANDRA-19983 had any effect on
this ticket? As you noted though it's been this way a long time so I would be
hesitant to change it and risk breaking some other case, or finding out someone
actually needed that bandwidth savings.
bq. Another approach would be to no longer use hibernate, ie. CASSANDRA-12344
That could be an idea worth exploring.
> Unable to contact any seeds with node in hibernate status
> ---------------------------------------------------------
>
> Key: CASSANDRA-19580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19580
> Project: Apache Cassandra
> Issue Type: Bug
> Components: Cluster/Gossip
> Reporter: Cameron Zemek
> Priority: Normal
> Fix For: 4.0.x, 4.1.x, 5.0.x, 5.x
>
>
> We have customer running into the error 'Unable to contact any seeds!' . I
> have been able to reproduce this issue if I kill Cassandra as its joining
> which will put the node into hibernate status. Once a node is in hibernate it
> will no longer receive any SYN messages from other nodes during startup and
> as it sends only itself as digest in outbound SYN messages it never receives
> any states in any of the ACK replies. So once it gets to the check
> `seenAnySeed` in it fails as the endpointStateMap is empty.
>
> A workaround is copying the system.peers table from other node but this is
> less than ideal. I tested modifying maybeGossipToSeed as follows:
> {code:java}
> /* Possibly gossip to a seed for facilitating partition healing */
> private void maybeGossipToSeed(MessageOut<GossipDigestSyn> prod)
> {
> int size = seeds.size();
> if (size > 0)
> {
> if (size == 1 &&
> seeds.contains(FBUtilities.getBroadcastAddress()))
> {
> return;
> }
> if (liveEndpoints.size() == 0)
> {
> List<GossipDigest> gDigests = prod.payload.gDigests;
> if (gDigests.size() == 1 &&
> gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress()))
> {
> gDigests = new ArrayList<GossipDigest>();
> GossipDigestSyn digestSynMessage = new
> GossipDigestSyn(DatabaseDescriptor.getClusterName(),
>
> DatabaseDescriptor.getPartitionerName(),
>
> gDigests);
> MessageOut<GossipDigestSyn> message = new
> MessageOut<GossipDigestSyn>(MessagingService.Verb.GOSSIP_DIGEST_SYN,
>
> digestSynMessage,
>
> GossipDigestSyn.serializer);
> sendGossip(message, seeds);
> }
> else
> {
> sendGossip(prod, seeds);
> }
> }
> else
> {
> /* Gossip with the seed with some probability. */
> double probability = seeds.size() / (double)
> (liveEndpoints.size() + unreachableEndpoints.size());
> double randDbl = random.nextDouble();
> if (randDbl <= probability)
> sendGossip(prod, seeds);
> }
> }
> }
> {code}
> Only problem is this is the same as SYN from shadow round. It does resolve
> the issue however as then receive an ACK with all the states.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]