[
https://issues.apache.org/jira/browse/CASSANDRA-19580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842799#comment-17842799
]
Cameron Zemek commented on CASSANDRA-19580:
-------------------------------------------
> Most of what you've described here are implementation details of how replace
> works, like how hibernate is handled, so I'm not sure if anything is wrong.
I do not follow what you mean by not sure if anything is wrong. The problem is
you can't do the replacement if for any reason the node ends up in hibernate
state. It is forever stuck in 'Unable to contact any seeds!' error, every
attempt at replacement results in that error. This has been a long running
issue that seen many times over the years but never managed to figure out the
cause of.
I do not know what the correct solution is to this, there seems to be many
possible approaches to fix. I am unaware of the reasons for how it's been
implemented in order to decide what would be the preferred method. For example,
I don't understand why responses to SYN do not include state for nodes that are
not in the digest list. Gossip been like this for a long time and therefore
seems rather major thing to change. Another approach would be to no longer use
hibernate, ie. CASSANDRA-12344
> Unable to contact any seeds with node in hibernate status
> ---------------------------------------------------------
>
> Key: CASSANDRA-19580
> URL: https://issues.apache.org/jira/browse/CASSANDRA-19580
> Project: Cassandra
> Issue Type: Bug
> Reporter: Cameron Zemek
> Priority: Normal
>
> We have customer running into the error 'Unable to contact any seeds!' . I
> have been able to reproduce this issue if I kill Cassandra as its joining
> which will put the node into hibernate status. Once a node is in hibernate it
> will no longer receive any SYN messages from other nodes during startup and
> as it sends only itself as digest in outbound SYN messages it never receives
> any states in any of the ACK replies. So once it gets to the check
> `seenAnySeed` in it fails as the endpointStateMap is empty.
>
> A workaround is copying the system.peers table from other node but this is
> less than ideal. I tested modifying maybeGossipToSeed as follows:
> {code:java}
> /* Possibly gossip to a seed for facilitating partition healing */
> private void maybeGossipToSeed(MessageOut<GossipDigestSyn> prod)
> {
> int size = seeds.size();
> if (size > 0)
> {
> if (size == 1 &&
> seeds.contains(FBUtilities.getBroadcastAddress()))
> {
> return;
> }
> if (liveEndpoints.size() == 0)
> {
> List<GossipDigest> gDigests = prod.payload.gDigests;
> if (gDigests.size() == 1 &&
> gDigests.get(0).endpoint.equals(FBUtilities.getBroadcastAddress()))
> {
> gDigests = new ArrayList<GossipDigest>();
> GossipDigestSyn digestSynMessage = new
> GossipDigestSyn(DatabaseDescriptor.getClusterName(),
>
> DatabaseDescriptor.getPartitionerName(),
>
> gDigests);
> MessageOut<GossipDigestSyn> message = new
> MessageOut<GossipDigestSyn>(MessagingService.Verb.GOSSIP_DIGEST_SYN,
>
> digestSynMessage,
>
> GossipDigestSyn.serializer);
> sendGossip(message, seeds);
> }
> else
> {
> sendGossip(prod, seeds);
> }
> }
> else
> {
> /* Gossip with the seed with some probability. */
> double probability = seeds.size() / (double)
> (liveEndpoints.size() + unreachableEndpoints.size());
> double randDbl = random.nextDouble();
> if (randDbl <= probability)
> sendGossip(prod, seeds);
> }
> }
> }
> {code}
> Only problem is this is the same as SYN from shadow round. It does resolve
> the issue however as then receive an ACK with all the states.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]