[
https://issues.apache.org/jira/browse/CASSANDRA-20011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17891425#comment-17891425
]
Stefan Miklosovic edited comment on CASSANDRA-20011 at 10/21/24 7:26 AM:
-------------------------------------------------------------------------
[~mck] [~curlylrt] this is the most probably a duplicate / variation of
CASSANDRA-18845 / CASSANDRA-18866
I welcome [~curlylrt] to create a formal patch for that on GitHub and further
verify / improve on that work there.
I think this is the culprit:
??When the new node joins the ring, it will try to wait for gossip to settle,
the issue we saw is that the gossip settled before it recognize the entire
cluster??
cc [~cam1982]
was (Author: smiklosovic):
[~mck] [~curlylrt] this is the most probably a duplicate / variation of
CASSANDRA-18845 / CASSANDRA-18866
I welcome [~curlylrt] to create a formal patch for that on GitHub and further
verify / improve on that work there.
I think this is the culprit:
??When the new node joins the ring, it will try to wait for gossip to settle,
the issue we saw is that the gossip settled before it recognize the entire
cluster??
> Gossip settled to early for new joining node leading to data loss
> -----------------------------------------------------------------
>
> Key: CASSANDRA-20011
> URL: https://issues.apache.org/jira/browse/CASSANDRA-20011
> Project: Cassandra
> Issue Type: Bug
> Reporter: Runtian Liu
> Priority: Normal
>
> Recently we found one issue with gossip settle to early leading to data loss
> on the joining node.
> The new node joining the ring crashed a few times before it successfully join
> the ring. When the new node joins the ring, it will try to wait for gossip to
> settle, the issue we saw is that the gossip settled before it recognize the
> entire cluster. This leads to the new node requesting ranges to the wrong
> nodes and stream phase ended without getting any data because the requested
> range on the target nodes are not the real owner of the token ranges.
> After checking the gossip settle code, I found that gossip may settle in 5 +
> 3 = 8 seconds if the new node local gossip statemap size is not changing.
> This may happen if the new node is busy with other gossip task and cannot
> populate all nodes to its local gossip state map.
> Proposing a fix for this to add a env variable for bootstrapping node so that
> we will also check the minimum number of nodes needed for a node to consider
> gossip settle. PR will come later.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]