[
https://issues.apache.org/jira/browse/CASSANDRA-17842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Josh McKenzie updated CASSANDRA-17842:
--------------------------------------
Status: Ready to Commit (was: Review In Progress)
> Add the ability for operators to allow intentional loosening of definition of
> "empty" in Gossip for specific edge case failure scenarios
> ----------------------------------------------------------------------------------------------------------------------------------------
>
> Key: CASSANDRA-17842
> URL: https://issues.apache.org/jira/browse/CASSANDRA-17842
> Project: Cassandra
> Issue Type: Improvement
> Components: Cluster/Gossip
> Reporter: Josh McKenzie
> Assignee: Josh McKenzie
> Priority: Normal
> Fix For: 4.x
>
>
> Right now {{empty}} is very specific to a single edge case (i.e. in
> {{isEmptyWithoutStatus()}} our usage of hbState() + applicationState), but
> there are other failure cases which block host replacements and require
> intrusive workarounds and human intervention to recover from when you have
> something in hbState() you don't expect.
> If we allow opt-in to a more risky (i.e. we don’t know how we got there)
> definition of empty, then host replacements can make progress even when
> Gossip's gotten into a bad state. Which it does. All too often.
> This parameter will obviously need some NEWS.txt and other documentation
> around it to explain the context for end users.
> Now that I think of it, general "how to troubleshoot Gossip problems" might
> be worth writing up and including this as part of it for operators and users,
> specifically on our
> [Troubleshooting|https://cassandra.apache.org/doc/latest/cassandra/troubleshooting/index.html]
> page. Probably create that as another ticket and defer that update to there
> and rely on news.txt and the param documentation for this one just to get the
> functionality into the system for operators who need it.
> A touch more context:
> {code}
> // In the very specific case where hbState.isEmpty and STATUS is missing,
> this is known to be safe to "fake"
> // the data, as this happens when the gossip state isn't coming from the node
> but instead from a peer who
> // restarted and is missing the node's state
> //
> // When hbState is *not* empty, then the node gossiped an empty STATUS, this
> happens during bootstrap and it's not
> // possible to tell if this is ok or not (we can't really tell if the node is
> dead or having networking issues);
> // for these cases we need to allow an external actor to verify and inform
> Cassandra that it is safe; this is done by
> // updating the LOOSE_DEF_OF_EMPTY_ENABLED field.
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]