I'm +1 to continuing work on CASSANDRA-18917 for all the reasons Jordan listed.
Sounds like the request was to hit the pause button until TCM merged rather than skipping the work entirely so that's promising. On Thu, May 16, 2024, at 1:43 PM, Jon Haddad wrote: > I have also recently worked with a teams who lost critical data as a result > of gossip issues combined with collision in our token allocation. I haven’t > filed a jira yet as it slipped my mind but I’ve seen it in my own testing as > well. I’ll get a JIRA in describing it in detail. > > It’s severe enough that it should probably block 5.0. > > Jon > > On Thu, May 16, 2024 at 10:37 AM Jordan West <jw...@apache.org> wrote: >> I’m a big +1 on 18917 or more testing of gossip. While I appreciate that it >> makes TCM more complicated, gossip and schema propagation bugs have been the >> source of our two worst data loss events in the last 3 years. Data loss >> should immediately cause us to evaluate what we can do better. >> >> We will likely live with gossip for at least 1, maybe 2, more years. >> Otherwise outside of bug fixes (and to some degree even still) I think the >> only other solution is to not touch gossip *at all* until we are all >> TCM-only which I don’t think is practical or realistic. recent changes to >> gossip in 4.1 introduced several subtle bugs that had serious impact (from >> data loss to loss of ability to safely replace nodes in the cluster). >> >> I am happy to contribute some time to this if lack of folks is the issue. >> >> Jordan >> >> On Mon, May 13, 2024 at 17:05 David Capwell <dcapw...@apple.com> wrote: >>> So, I created https://issues.apache.org/jira/browse/CASSANDRA-18917 which >>> lets you do deterministic gossip simulation testing cross large clusters >>> within seconds… I stopped this work as it conflicted with TCM (they were >>> trying to merge that week) and it hit issues where some nodes never >>> converged… I didn’t have time to debug so I had to drop the patch… >>> >>> This type of change would be a good reason to resurrect that patch as >>> testing gossip is super dangerous right now… its behavior is only in a few >>> peoples heads and even then its just bits and pieces scattered cross >>> multiple people (and likely missing pieces)… >>> >>> My brain is far too fried right now to say your idea is safe or not, but >>> honestly feel that we would need to improve our tests (we have 0) before >>> making such a change… >>> >>> I do welcome the patch though... >>> >>> >>>> On May 12, 2024, at 8:05 PM, Zemek, Cameron via dev >>>> <dev@cassandra.apache.org> wrote: >>>> >>>> In looking into CASSANDRA-19580 I noticed something that raises a >>>> question. With Gossip SYN it doesn't check for missing digests. If its >>>> empty for shadow round it will add everything from endpointStateMap to the >>>> reply. But why not included missing entries in normal replies? The >>>> branching for reply handling of SYN requests could then be merged into >>>> single code path (though shadow round handles empty state different with >>>> CASSANDRA-16213). Potential is performance impact as this requires doing a >>>> set difference. >>>> >>>> For example, something along the lines of: >>>> >>>> ``` >>>> Set<InetAddressAndPort> missing = new >>>> HashSet<>(endpointStateMap.keySet()); >>>> >>>> missing.removeAll(gDigestList.stream().map(GossipDigest::getEndpoint).collect(Collectors.toSet())); >>>> for ( InetAddressAndPort endpoint : missing) >>>> { >>>> gDigestList.add(new GossipDigest(endpoint, 0, 0)); >>>> } >>>> ``` >>>> >>>> It seems odd to me that after shadow round for a new node we have >>>> endpointStateMap with only itself as an entry. Then the only way it gets >>>> the gossip state is by another node choosing to send the new node a gossip >>>> SYN. The choosing of this is random. Yeah this happens every second so >>>> eventually its going to receive one (outside the issue of CASSANDRA-19580 >>>> were it doesn't if its in a dead state like hibernate) , but doesn't this >>>> open up bootstrapping to failures on very large clusters as it can take >>>> longer before its sent a SYN (as the odds of being chosen for SYN get >>>> lower)? For years been seeing bootstrap failures with 'Unable to contact >>>> any seeds' but they are infrequent and never been able to figure out how >>>> to reproduce in order to open a ticket, but I wonder if some of them have >>>> been due to not receiving a SYN message before it does the seenAnySeed >>>> check.