Since I have been re-playing Ghost of Tsushima, I felt a Haiku would be appropriate.
my cluster is failing jConsole to the rescue now I am failing On Mon, Nov 8, 2021 at 12:46 PM Joshua McKenzie <joshua.mcken...@gmail.com> wrote: > First off - Congrats again to Sumanth Pasupuleti on becoming a committer on > the project! Well deserved; looking forward to working with you further. > > It looks like ponymail got an upgrade; I didn't even realize that was > possible at this point. :) So caveat emptor: the links I put in here to > individual email threads are different than in the past but appear to be > working. > > [New contributors getting started] > There's been some discussion about whether the #cassandra-dev channel with > 600 people in it is the best place for new contributors to get involved and > publicly ask beginner questions or whether we should start a new channel > with a somewhat more limited scope. Please chime in on that dev mailing > list thread if you have an opinion: > https://lists.apache.org/thread/x8fx9b22nfll3gd40w4o971cyznckxrz > > As a new contributor we recommend starting in one of two places: Failing > tests, or starter tickets we label "lhf" (low hanging fruit). > Query for failing tests: > > https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=496&quickFilter=2252 > Query for unassigned starter tickets: > > https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&quickFilter=2162&quickFilter=2160 > > We're up from 18 unassigned test failures to 22 in the past couple of > weeks. David Capwell, Berenguer Blasi, and Ekaterina Dimitrova (and > others!) have been doing some great work both surfacing failures as well as > fixing things - thank you! > > For unassigned lhf, we're up from 10 to 11 on 4.0.2 (our next minor > release) and up from 13 to 14 on 4.1.0 (our next major release). Feel free > to self-select from that list, hit up this email thread or list if you want > some guidance on where to get involved, ping in the #cassandra-dev slack > channel on the-asf.slack.com server, or email or message me directly if > you > want any help. > > [Dev list discussions in the past 14 days] > https://lists.apache.org/list?dev@cassandra.apache.org:lte=2w: > > We have an ongoing discussion about what it means to have a releasable > trunk and what steps, if any, it'd take to get there. Given the scale and > complexity of this project and its testing infrastructure, I'm curious to > hear what other experiences people have had with applying select CI and CD > principles to an ecosystem like this: > https://lists.apache.org/thread/kyyo5k3my2nx160mfgy0xkwo8xjh2qpv > > As mentioned above, there's an ongoing discussion about how to make the > cassandra dev community more welcoming for newcomers: > https://lists.apache.org/thread/x8fx9b22nfll3gd40w4o971cyznckxrz > > Andres surfaced CEP-3 for guardrails in which we all professed our > continued love for JMX (especially you Patrick). It'd be great to see more > operators chime in with their experience running clusters at scale and the > type of anti-patterns of usage that destabilize clusters since guardrails > would be a great way to expose protection against frequently occurring > patterns that scales poorly, among other things (tombstone heavy workloads > and thousands of tables anyone?) > > > CEP-18: Improving Modularity is going to be deprecated in favor of > module-specific refactors and optional implementations. > > CEP-17: SSTable format API is evolving nicely: > https://lists.apache.org/thread/boqb5trkq1q38rmb50p4lsw95hyv053m > > And these are just the highlights! > > [Tickets in the past 14 days] > On the 4.0.2 front we've closed out 5 tickets compared to 9 in the prior 2 > weeks. Looks like permissions, some timeouts during replica failure, > website updates, etc. > > For 4.1.0 we've closed out 8 issues down from 14. Some stability in schema > pulls, commit log stability during testing, a slew of test fixes, and a new > feature to allow denying access to configured partition keys for reads, > writes, or range reads based on config (CQL or JMX). > > [Tickets that need attention] > Needs Reviewer: > > https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&selectedIssue=CASSANDRA-16547&quickFilter=2259 > > I've tidied up / created a new quick filter that's tickets that are in > progress, blocked, or patch available but lacking a reviewer. This is > slightly opinionated of me in that it implies we should have reviewers for > things as we work on them rather than once they're further along being > written; I have a bias towards early inclusion of a 2nd pair of eyes and a > sounding board. If you see anything on this list that you're qualified to > review on or know the area of the code-base and have a few cycles, please > take a look and help out. > > Workload wise, 14 tickets on 4.0.2 need reviewers and 34 on 4.1.0 by this > definition. > > I'm going to refrain from linking to stalled tickets (30d inactive) for > now; the load of that is high (80 on 4.0.2, 422 on 4.1.0) so we probably > should approach this a little differently if we want to tidy up or prune > that backlog. It's as simple as a fixversion flag so doesn't really > indicate _too_ much to worry about. > > [Test Failure Trendlines] > So first off, we have a good number of tests in this project. 43,000 or so > now. It's helpful to keep that in mind when we talk about having 5, 10, or > even 50 test failures relative to the total corpus. Unfortunately, > databases are like compilers in that they're rather unforgiving of even a > .125% failure rate. > > So what's our test failure trend? We have 2 trendlines of interest: > 1) The documented JIRA-ticket created test failures on the project: > > https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=496&view=reporting&chart=cumulativeFlowDiagram&swimlane=1233&swimlane=1234&column=2195&column=2196&column=2197&days=90 > > We can see where I got feisty creating test failure tickets when trying to > merge the Denylist patch a week ago. In general, the volume of "open > tickets for known test failures" has been growing: > > https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=496&view=reporting&chart=cumulativeFlowDiagram&swimlane=1233&swimlane=1234&column=2195&column=2196&column=2197&days=90 > > That said, this could be due to a variety of factors: more failures, > increased discipline around tracking, or even poor hygiene closing out > tickets when we fix the related tests. > > 2) The metric that I think is a bit cleaner and more informative is our > test failure history on our jenkins build server (assuming I can ever get > it to load /groan): > > > https://ci-cassandra.apache.org/job/Cassandra-trunk/lastCompletedBuild/testReport/history/ > > In general we've been pretty clean (meaning single digit failures) since > the 4.0 release; as discussed in another thread, the recent spate of > failures caused by dtest-api dependency changes is being addressed in > CASSANDRA-17050. Silver lining: that situation has surfaced 1) a need for a > discussion and improvement around how we work with dependent projects and > release dependencies in Cassandra (all in one IDE as subprojects vs. > separate projects, release dependencies, etc) and we can expect to see a > DISCUSS thread about that soon, and 2) that there's broader failures going > on with some of the python dtests for a bit here we need to get to the > bottom of. > > And that's a wrap folks. I call this one "The Calm Before the Storm" if our > CEP's are any indicator. :) > > As always, thanks everyone for the time, effort, and collaboration on the > project. > > ~Josh >