Re: CASSANDRA-20910 Review. Adding a JSON ApplicationState

Berenguer Blasi Thu, 05 Mar 2026 22:47:45 -0800

HI,

thanks for the replies. The case I hit nodes had different cluster namesbut as this check is not done on all gossip message paths it somehow gotthrough.

If the consensus is to leave as it is so be it. We can revisit if neededand a POC PR is already up.


Regards

On 3/4/26 15:33, Sam Tunnicliffe wrote:

As another who has worked on/with/around gossip a lot, I would echo David and 
Caleb's concerns and strongly discourage making changes like this, especially 
when we don't have even have a reliable path to repro the bug report.

It is the case that only the gossip SYN currently includes the clustername & 
partitioner, but even this offers weak protection; what are the odds of having the 
same name and partitioner across multiple clusters, especially if we are getting 
into the realms of misconfiguration and operator error (or even malicious actors)?
In the default configuration, it's 100% possible to send hand crafted or 
accidentally misaddressed messages of any kind to the cluster, not just gossip 
messages [1].

IMO the right solution is to properly secure internode messaging. Configure 
internode encryption, using per-cluster certs / keystores and maybe turn on 
hostname verification via the server_encryption_options section of 
cassandra.yaml

[1] https://cassandra.apache.org/doc/4.1/cassandra/operating/security.html

On 4 Mar 2026, at 08:24, Berenguer Blasi <[email protected]> wrote:

Hi,

thanks all for engaging.

Yes GossipDigestSyn does check that but that is done once and never again on 
later messages. On unintentional operator accidents: an IP was thought to be 
free but another node in a different cluster with that IP is started, pluggable 
storage with system tables connected to wrong node, bad config copied over, 
etc. You get instances where a legit node passed the GossipDigestSyn rightfully 
but sometime later, through operational error, a node on a different cluster 
impersonates that first node and you're in big trouble. I didn't manage to 
reproduce but I know of 2 instances this happened.

The solution in this PR decorates more Gossip messages with DC and partitioner 
to check that membership more often. This is about better dealing with 
'failure' or broken scenarios, the 'happy-path', as you rightfully pointed out, 
is already covered.

Regards

On 3/3/26 20:08, David Capwell wrote:

Im on leave so not really going to look too close, but as someone who has 
worked on gossip a lot, im hesitant to add more state to it; more chances for 
hard to understand race bugs that brick gossip (its taken years to get 
stable-ish… there are still issues that we don’t know how to repo/fix).

Just looking at the linked JIRA summary "Instances from a 2nd ring join another 
ring when running on the same nodes” it feels that internode auth (block nodes from 
joining the wrong ring) is the best solution here? Also gossip does validate the 
cluster id / partitioner, we do this in `GossipDigestSyn`.  So feels like there is 
something else going on and modifying gossip isn’t the right track?  This metadata 
isn’t part of the application state, but its already part of the gossip protocol, so 
adding a json payload to have the same details im not sure how that solves the 
reported problem?

On Mar 3, 2026, at 10:39 AM, Caleb Rackliffe <[email protected]> wrote:

I'm a little hesitant to allow a generic JSON payload, but still need to think 
a bit on it.

On Tue, Mar 3, 2026 at 1:43 AM Berenguer Blasi <[email protected]> wrote:
Hi,

We've seen this issue in some production systems and I've been asked to
raise this to the list for visibility.

The main idea[1] is to propagate partitioner and cluster name through
Gossip and validate these. The approach I took is to Json encode those
in a generic JSON_PAYLOAD new AppState but I lack the historical context
as to why enum ordinals were used in the first place: Imo Json encoding
going forward:

- Prevents burning extra AppStates

- Prevents forks with custom AppStates on online rolling upgrades to OSS
conflict on the mapping (scary)

- Friendlier to being extended, customized and more robust towards
modifications

Options:

A. Introduce this new generic state (4-0 -> trunk) and we use this onward

B. Drop the idea of a generic json AppState and just add one new
AppState for this ticket 4.0->5.0 as this is not an issue in trunk due
to TCM. This one de-risks the upcoming trunk release and could be
repurposed in the future to become A if we chose so.

Thoughts welcomed, thanks in advance.


[1] https://issues.apache.org/jira/browse/CASSANDRA-20910

Re: CASSANDRA-20910 Review. Adding a JSON ApplicationState

Reply via email to