Re: Measuring Release Quality

kurt greaves Sat, 22 Sep 2018 03:56:05 -0700

I'm interested. Better defining the components and labels we use in our
docs would be a good start and LHF. I'd prefer if we kept all the
information within JIRA through the use of fields/labels though, and
generated reports off those tags. Keeping all the information in one place
is much better in my experience. Not applicable for CI obviously, but
ideally we can generate testing reports directly from the testing systems.


I don't see this as a huge amount of work so I think the overall risk is
pretty small, especially considering it can easily be done in a way that
doesn't affect anyone until we get consensus on methodology.



On Sat, 22 Sep 2018 at 03:44, Scott Andreas <sc...@paradoxica.net> wrote:

> Josh, thanks for reading and sharing feedback. Agreed with starting simple
> and measuring inputs that are high-signal; that’s a good place to begin.
>
> To the challenge of building consensus, point taken + agreed. Perhaps the
> distinction is between producing something that’s “useful” vs. something
> that’s “authoritative” for decisionmaking purposes. My motivation is to
> work toward something “useful” (as measured by the value contributors
> find). I’d be happy to start putting some of these together as part of an
> experiment – and agreed on evaluating “value relative to cost” after we see
> how things play out.
>
> To Benedict’s point on JIRA, agreed that plotting a value from messy input
> wouldn’t produce useful output. Some questions a small working group might
> take on toward better categorization might look like:
>
> –––
> – Revisiting the list of components: e.g., “Core” captures a lot right now.
> – Revisiting which fields should be required when filing a ticket – and if
> there are any that should be removed from the form.
> – Reviewing active labels: understanding what people have been trying to
> capture, and how they could be organized + documented better.
> – Documenting “priority”: (e.g., a common standard we can point to, even
> if we’re pretty good now).
> – Considering adding a "severity” field to capture the distinction between
> priority and severity.
> –––
>
> If there’s appetite for spending a little time on this, I’d put effort
> toward it if others are interested; is anyone?
>
> Otherwise, I’m equally fine with an experiment to measure basics via the
> current structure as Josh mentioned, too.
>
> – Scott
>
>
> On September 20, 2018 at 8:22:55 AM, Benedict Elliott Smith (
> bened...@apache.org<mailto:bened...@apache.org>) wrote:
>
> I think it would be great to start getting some high quality info out of
> JIRA, but I think we need to clean up and standardise how we use it to
> facilitate this.
>
> Take the Component field as an example. This is the current list of
> options:
>
> 4.0
> Auth
> Build
> Compaction
> Configuration
> Core
> CQL
> Distributed Metadata
> Documentation and Website
> Hints
> Libraries
> Lifecycle
> Local Write-Read Paths
> Materialized Views
> Metrics
> Observability
> Packaging
> Repair
> SASI
> Secondary Indexes
> Streaming and Messaging
> Stress
> Testing
> Tools
>
> In some cases there's duplication (Metrics + Observability, Coordination
> (=“Storage Proxy, Hints, Batchlog, Counters…") + Hints, Local Write-Read
> Paths + Core)
> In others, there’s a lack of granularity (Streaming + Messaging, Core,
> Coordination, Distributed Metadata)
> In others, there’s a lack of clarity (Core, Lifecycle, Coordination)
> Others are probably missing entirely (Transient Replication, …?)
>
> Labels are also used fairly haphazardly, and there’s no clear definition
> of “priority”
>
> Perhaps we should form a working group to suggest a methodology for
> filling out JIRA, standardise the necessary components, labels etc, and put
> together a wiki page with step-by-step instructions on how to do it?
>
>
> > On 20 Sep 2018, at 15:29, Joshua McKenzie <jmcken...@apache.org> wrote:
> >
> > I've spent a good bit of time thinking about the above and bounced off
> both
> > different ways to measure quality and progress as well as trying to
> > influence community behavior on this topic. My advice: start small and
> > simple (KISS, YAGNI, all that). Get metrics for pass/fail on
> > utest/dtest/flakiness over time, perhaps also aggregate bug count by
> > component over time. After spending a predetermined time doing that (a
> > couple months?) as an experiment, we retrospect as a project and see if
> > these efforts are adding value commensurate with the time investment
> > required to perform the measurement and analysis.
> >
> > There's a lot of really good ideas in that linked wiki article / this
> email
> > thread. The biggest challenge, and risk of failure, is in translating
> good
> > ideas into action and selling project participants on the value of
> changing
> > their behavior. The latter is where we've fallen short over the years;
> > building consensus (especially regarding process /shudder) is Very Hard.
> >
> > Also - thanks for spearheading this discussion Scott. It's one we come
> back
> > to with some regularity so there's real pain and opportunity here for the
> > project imo.
> >
> > On Wed, Sep 19, 2018 at 9:32 PM Scott Andreas <sc...@paradoxica.net>
> wrote:
> >
> >> Hi everyone,
> >>
> >> Now that many teams have begun testing and validating Apache Cassandra
> >> 4.0, it’s useful to think about what “progress” looks like. While
> metrics
> >> alone may not tell us what “done” means, they do help us answer the
> >> question, “are we getting better or worse — and how quickly”?
> >>
> >> A friend described to me a few attributes of metrics he considered
> useful,
> >> suggesting that good metrics are actionable, visible, predictive, and
> >> consequent:
> >>
> >> – Actionable: We know what to do based on them – where to invest, what
> to
> >> fix, what’s fine, etc.
> >> – Visible: Everyone who has a stake in a metric has full visibility into
> >> it and participates in its definition.
> >> – Predictive: Good metrics enable forecasting of outcomes – e.g.,
> >> “consistent performance test results against build abc predict an x%
> >> reduction in 99%ile read latency for this workload in prod".
> >> – Consequent: We take actions based on them (e.g., not shipping if tests
> >> are failing).
> >>
> >> Here are some notes in Confluence toward metrics that may be useful to
> >> track beginning in this phase of the development + release cycle. I’m
> >> interested in your thoughts on these. They’re also copied inline for
> easier
> >> reading in your mail client.
> >>
> >> Link:
> >>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=93324430
> >>
> >> Cheers,
> >>
> >> – Scott
> >>
> >> ––––––
> >>
> >> Measuring Release Quality:
> >>
> >> [ This document is a draft + sketch of ideas. It is located in the
> >> "discussion" section of this wiki to indicate that it is an active
> draft –
> >> not a document that has been voted on, achieved consensus, or in any way
> >> official. ]
> >>
> >> Introduction:
> >>
> >> This document outlines a series of metrics that may be useful toward
> >> measuring release quality, and quantifying progress during the testing /
> >> validation phase of the Apache Cassandra 4.0 release cycle.
> >>
> >> The goal of this document is to think through what we should consider
> >> measuring to quantify our progress testing and validating Apache
> Cassandra
> >> 4.0. This document explicitly does not discuss release criteria – though
> >> metrics may be a useful input to a discussion on that topic.
> >>
> >>
> >> Metric: Build / Test Health (produced via CI, recorded in Confluence):
> >>
> >> Bread-and-butter metrics intended to capture baseline build health,
> >> flakiness in the test suite, and presented as a time series to
> understand
> >> how they’ve changed from build to build and release to release:
> >>
> >> Metrics:
> >>
> >> – Pass / fail metrics for unit tests
> >> – Pass / fail metrics for dtests
> >> – Flakiness stats for unit and dtests
> >>
> >>
> >> Metric: “Found Bug” Count by Methodology (sourced via JQL, reported in
> >> Confluence):
> >>
> >> These are intended to help us understand the efficacy of each
> methodology
> >> being applied. We might consider annotating bugs found in JIRA with the
> >> methodology that produced them. This could be consumed as input in a JQL
> >> query and reported on the Confluence dev wiki.
> >>
> >> As we reach a pareto-optimal level of investment in a methodology, we’d
> >> expect to see its found-bug rate taper. As we achieve higher quality
> across
> >> the board, we’d expect to see a tapering in found-bug counts across all
> >> methodologies. In the event that one or two approaches is an outlier,
> this
> >> could indicate the utility of doubling down on a particular form of
> testing.
> >>
> >> We might consider reporting “Found By” counts for methodologies such as:
> >>
> >> – Property-based / fuzz testing
> >> – Replay testing
> >> – Upgrade / Diff testing
> >> – Performance testing
> >> – Shadow traffic
> >> – Unit/dtest coverage of new areas
> >> – Source audit
> >>
> >>
> >> Metric: “Found Bug” Count by Subsystem/Component (sourced via JQL,
> >> reported in Confluence):
> >>
> >> Similar to “found by,” but “found where.” These metrics help us
> understand
> >> which components or subsystems of the database we’re finding issues in.
> In
> >> the event that a particular area stands out as “hot,” we’ll have the
> >> quantitative feedback we need to support investment there. Tracking
> these
> >> counts over time – and their first derivative – the rate – also helps us
> >> make statements regarding progress in various subsystems. Though we
> can’t
> >> prove a negative (“no bugs have been found, therefore there are no
> bugs”),
> >> we gain confidence as their rate decreases normalized to the effort
> we’re
> >> putting in.
> >>
> >> We might consider reporting “Found In” counts for components as
> enumerated
> >> in JIRA, such as:
> >> – Auth
> >> – Build
> >> – Compaction
> >> – Compression
> >> – Core
> >> – CQL
> >> – Distributed Metadata
> >> – …and so on.
> >>
> >>
> >> Metric: “Found Bug” Count by Severity (sourced via JQL, reported in
> >> Confluence)
> >>
> >> Similar to “found by/where,” but “how bad”? These metrics help us
> >> understand the severity of the issues we encounter. As build quality
> >> improves, we would expect to see decreases in the severity of issues
> >> identified. A high rate of critical issues identified late in the
> release
> >> cycle would be cause for concern, though it may be expected at an
> earlier
> >> time.
> >>
> >> These could roughly be sourced from the “Priority” field in JIRA:
> >> – Trivial
> >> – Minor
> >> – Major
> >> – Critical
> >> – Blocker
> >>
> >> While “priority” doesn’t map directly to “severity,” it may be a useful
> >> proxy. Alternately, we could introduce a label intended to represent
> >> severity if we’d like to make that clear.
> >>
> >>
> >> Metric: Performance Tests
> >>
> >> Performance tests tell us “how fast” (and “how expensive”). There are
> many
> >> metrics we could capture here, and a variety of workloads they could be
> >> sourced from.
> >>
> >> I’ll refrain from proposing a particular methodology or reporting
> >> structure since many have thought about this. From a reporting
> perspective,
> >> I’m inspired by Mozilla’s “arewefastyet.com<http://arewefastyet.com>”
> >> used to report the performance of their Javascript engine relative to
> >> Chrome’s: https://arewefastyet.com/win10/overview
> >>
> >> Having this sort of feedback on a build-by-build basis would help us
> catch
> >> regressions, quantify improvements, and provide a baseline against 3.0
> and
> >> 3.x.
> >>
> >>
> >> Metric: Code Coverage (/ other static analysis techniques)
> >>
> >> It may also be useful to publish metrics from CI on code coverage by
> >> package/class/method/branch. These might not be useful metrics for
> >> “quality” (the relationship between code coverage and quality is
> tenuous).
> >>
> >> However, it would be useful to quantify the trend over time between
> >> releases, and to source a “to-do” list for important but poorly-covered
> >> areas of the project.
> >>
> >>
> >> Others:
> >>
> >> There are more things we could measure. We won’t want to drown ourselves
> >> in metrics (or the work required to gather them) –– but there are likely
> >> more not described here that could be useful to consider.
> >>
> >>
> >> Convergence Across Metrics:
> >>
> >> The thesis of this document is that improvements in each of these areas
> >> are correlated with increases in quality. Improvements across all areas
> are
> >> correlated with an increase in overall release quality. Tracking metrics
> >> like these provides the quantitative foundation for assessing progress,
> >> setting goals, and defining criteria. In that sense, they’re not an end
> –
> >> but a beginning.
> >>
>
>

Re: Measuring Release Quality

Reply via email to