[ https://issues.apache.org/jira/browse/CASSANDRA-17842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Josh McKenzie updated CASSANDRA-17842: -------------------------------------- Reviewers: David Capwell Status: Review In Progress (was: Patch Available) > Add the ability for operators to allow intentional loosening of definition of > "empty" in Gossip for specific edge case failure scenarios > ---------------------------------------------------------------------------------------------------------------------------------------- > > Key: CASSANDRA-17842 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17842 > Project: Cassandra > Issue Type: Improvement > Components: Cluster/Gossip > Reporter: Josh McKenzie > Assignee: Josh McKenzie > Priority: Normal > Fix For: 4.x > > > Right now {{empty}} is very specific to a single edge case (i.e. in > {{isEmptyWithoutStatus()}} our usage of hbState() + applicationState), but > there are other failure cases which block host replacements and require > intrusive workarounds and human intervention to recover from when you have > something in hbState() you don't expect. > If we allow opt-in to a more risky (i.e. we don’t know how we got there) > definition of empty, then host replacements can make progress even when > Gossip's gotten into a bad state. Which it does. All too often. > This parameter will obviously need some NEWS.txt and other documentation > around it to explain the context for end users. > Now that I think of it, general "how to troubleshoot Gossip problems" might > be worth writing up and including this as part of it for operators and users, > specifically on our > [Troubleshooting|https://cassandra.apache.org/doc/latest/cassandra/troubleshooting/index.html] > page. Probably create that as another ticket and defer that update to there > and rely on news.txt and the param documentation for this one just to get the > functionality into the system for operators who need it. > A touch more context: > {code} > // In the very specific case where hbState.isEmpty and STATUS is missing, > this is known to be safe to "fake" > // the data, as this happens when the gossip state isn't coming from the node > but instead from a peer who > // restarted and is missing the node's state > // > // When hbState is *not* empty, then the node gossiped an empty STATUS, this > happens during bootstrap and it's not > // possible to tell if this is ok or not (we can't really tell if the node is > dead or having networking issues); > // for these cases we need to allow an external actor to verify and inform > Cassandra that it is safe; this is done by > // updating the LOOSE_DEF_OF_EMPTY_ENABLED field. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org