Since I have been re-playing Ghost of Tsushima, I felt a Haiku would be
appropriate.

my cluster is failing
jConsole to the rescue
now I am failing





On Mon, Nov 8, 2021 at 12:46 PM Joshua McKenzie <joshua.mcken...@gmail.com>
wrote:

> First off - Congrats again to Sumanth Pasupuleti on becoming a committer on
> the project! Well deserved; looking forward to working with you further.
>
> It looks like ponymail got an upgrade; I didn't even realize that was
> possible at this point. :) So caveat emptor: the links I put in here to
> individual email threads are different than in the past but appear to be
> working.
>
> [New contributors getting started]
> There's been some discussion about whether the #cassandra-dev channel with
> 600 people in it is the best place for new contributors to get involved and
> publicly ask beginner questions or whether we should start a new channel
> with a somewhat more limited scope. Please chime in on that dev mailing
> list thread if you have an opinion:
> https://lists.apache.org/thread/x8fx9b22nfll3gd40w4o971cyznckxrz
>
> As a new contributor we recommend starting in one of two places: Failing
> tests, or starter tickets we label "lhf" (low hanging fruit).
> Query for failing tests:
>
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=496&quickFilter=2252
> Query for unassigned starter tickets:
>
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&quickFilter=2162&quickFilter=2160
>
> We're up from 18 unassigned test failures to 22 in the past couple of
> weeks. David Capwell, Berenguer Blasi, and Ekaterina Dimitrova (and
> others!) have been doing some great work both surfacing failures as well as
> fixing things - thank you!
>
> For unassigned lhf, we're up from 10 to 11 on 4.0.2 (our next minor
> release) and up from 13 to 14 on 4.1.0 (our next major release). Feel free
> to self-select from that list, hit up this email thread or list if you want
> some guidance on where to get involved, ping in the #cassandra-dev slack
> channel on the-asf.slack.com server, or email or message me directly if
> you
> want any help.
>
> [Dev list discussions in the past 14 days]
> https://lists.apache.org/list?dev@cassandra.apache.org:lte=2w:
>
> We have an ongoing discussion about what it means to have a releasable
> trunk and what steps, if any, it'd take to get there. Given the scale and
> complexity of this project and its testing infrastructure, I'm curious to
> hear what other experiences people have had with applying select CI and CD
> principles to an ecosystem like this:
> https://lists.apache.org/thread/kyyo5k3my2nx160mfgy0xkwo8xjh2qpv
>
> As mentioned above, there's an ongoing discussion about how to make the
> cassandra dev community more welcoming for newcomers:
> https://lists.apache.org/thread/x8fx9b22nfll3gd40w4o971cyznckxrz
>
> Andres surfaced CEP-3 for guardrails in which we all professed our
> continued love for JMX (especially you Patrick). It'd be great to see more
> operators chime in with their experience running clusters at scale and the
> type of anti-patterns of usage that destabilize clusters since guardrails
> would be a great way to expose protection against frequently occurring
> patterns that scales poorly, among other things (tombstone heavy workloads
> and thousands of tables anyone?)
>
>
> CEP-18: Improving Modularity is going to be deprecated in favor of
> module-specific refactors and optional implementations.
>
> CEP-17: SSTable format API is evolving nicely:
> https://lists.apache.org/thread/boqb5trkq1q38rmb50p4lsw95hyv053m
>
> And these are just the highlights!
>
> [Tickets in the past 14 days]
> On the 4.0.2 front we've closed out 5 tickets compared to 9 in the prior 2
> weeks. Looks like permissions, some timeouts during replica failure,
> website updates, etc.
>
> For 4.1.0 we've closed out 8 issues down from 14. Some stability in schema
> pulls, commit log stability during testing, a slew of test fixes, and a new
> feature to allow denying access to configured partition keys for reads,
> writes, or range reads based on config (CQL or JMX).
>
> [Tickets that need attention]
> Needs Reviewer:
>
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&selectedIssue=CASSANDRA-16547&quickFilter=2259
>
> I've tidied up / created a new quick filter that's tickets that are in
> progress, blocked, or patch available but lacking a reviewer. This is
> slightly opinionated of me in that it implies we should have reviewers for
> things as we work on them rather than once they're further along being
> written; I have a bias towards early inclusion of a 2nd pair of eyes and a
> sounding board. If you see anything on this list that you're qualified to
> review on or know the area of the code-base and have a few cycles, please
> take a look and help out.
>
> Workload wise, 14 tickets on 4.0.2 need reviewers and 34 on 4.1.0 by this
> definition.
>
> I'm going to refrain from linking to stalled tickets (30d inactive) for
> now; the load of that is high (80 on 4.0.2, 422 on 4.1.0) so we probably
> should approach this a little differently if we want to tidy up or prune
> that backlog. It's as simple as a fixversion flag so doesn't really
> indicate _too_ much to worry about.
>
> [Test Failure Trendlines]
> So first off, we have a good number of tests in this project. 43,000 or so
> now. It's helpful to keep that in mind when we talk about having 5, 10, or
> even 50 test failures relative to the total corpus. Unfortunately,
> databases are like compilers in that they're rather unforgiving of even a
> .125% failure rate.
>
> So what's our test failure trend? We have 2 trendlines of interest:
> 1) The documented JIRA-ticket created test failures on the project:
>
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=496&view=reporting&chart=cumulativeFlowDiagram&swimlane=1233&swimlane=1234&column=2195&column=2196&column=2197&days=90
>
> We can see where I got feisty creating test failure tickets when trying to
> merge the Denylist patch a week ago. In general, the volume of "open
> tickets for known test failures" has been growing:
>
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=496&view=reporting&chart=cumulativeFlowDiagram&swimlane=1233&swimlane=1234&column=2195&column=2196&column=2197&days=90
>
> That said, this could be due to a variety of factors: more failures,
> increased discipline around tracking, or even poor hygiene closing out
> tickets when we fix the related tests.
>
> 2) The metric that I think is a bit cleaner and more informative is our
> test failure history on our jenkins build server (assuming I can ever get
> it to load /groan):
>
>
> https://ci-cassandra.apache.org/job/Cassandra-trunk/lastCompletedBuild/testReport/history/
>
> In general we've been pretty clean (meaning single digit failures) since
> the 4.0 release; as discussed in another thread, the recent spate of
> failures caused by dtest-api dependency changes is being addressed in
> CASSANDRA-17050. Silver lining: that situation has surfaced 1) a need for a
> discussion and improvement around how we work with dependent projects and
> release dependencies in Cassandra (all in one IDE as subprojects vs.
> separate projects, release dependencies, etc) and we can expect to see a
> DISCUSS thread about that soon, and 2) that there's broader failures going
> on with some of the python dtests for a bit here we need to get to the
> bottom of.
>
> And that's a wrap folks. I call this one "The Calm Before the Storm" if our
> CEP's are any indicator. :)
>
> As always, thanks everyone for the time, effort, and collaboration on the
> project.
>
> ~Josh
>

Reply via email to