[ 
https://issues.apache.org/jira/browse/CASSANDRA-17842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh McKenzie updated CASSANDRA-17842:
--------------------------------------
    Reviewers: David Capwell
       Status: Review In Progress  (was: Patch Available)

> Add the ability for operators to allow intentional loosening of definition of 
> "empty" in Gossip for specific edge case failure scenarios
> ----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-17842
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17842
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Cluster/Gossip
>            Reporter: Josh McKenzie
>            Assignee: Josh McKenzie
>            Priority: Normal
>             Fix For: 4.x
>
>
> Right now {{empty}} is very specific to a single edge case (i.e. in 
> {{isEmptyWithoutStatus()}} our usage of hbState() + applicationState), but 
> there are other failure cases which block host replacements and require 
> intrusive workarounds and human intervention to recover from when you have 
> something in hbState() you don't expect.
> If we allow opt-in to a more risky (i.e. we don’t know how we got there) 
> definition of empty, then host replacements can make progress even when 
> Gossip's gotten into a bad state. Which it does. All too often.
> This parameter will obviously need some NEWS.txt and other documentation 
> around it to explain the context for end users.
> Now that I think of it, general "how to troubleshoot Gossip problems" might 
> be worth writing up and including this as part of it for operators and users, 
> specifically on our 
> [Troubleshooting|https://cassandra.apache.org/doc/latest/cassandra/troubleshooting/index.html]
>  page. Probably create that as another ticket and defer that update to there 
> and rely on news.txt and the param documentation for this one just to get the 
> functionality into the system for operators who need it.
> A touch more context:
> {code}
> // In the very specific case where hbState.isEmpty and STATUS is missing, 
> this is known to be safe to "fake"
> // the data, as this happens when the gossip state isn't coming from the node 
> but instead from a peer who
> // restarted and is missing the node's state
> //
> // When hbState is *not* empty, then the node gossiped an empty STATUS, this 
> happens during bootstrap and it's not
> // possible to tell if this is ok or not (we can't really tell if the node is 
> dead or having networking issues);
> // for these cases we need to allow an external actor to verify and inform 
> Cassandra that it is safe; this is done by
> // updating the LOOSE_DEF_OF_EMPTY_ENABLED field.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to