Measuring Release Quality

Scott Andreas Wed, 19 Sep 2018 18:32:24 -0700

Hi everyone,

Now that many teams have begun testing and validating Apache Cassandra 4.0, 
it’s useful to think about what “progress” looks like. While metrics alone may 
not tell us what “done” means, they do help us answer the question, “are we 
getting better or worse — and how quickly”?


A friend described to me a few attributes of metrics he considered useful, 
suggesting that good metrics are actionable, visible, predictive, and 
consequent:

– Actionable: We know what to do based on them – where to invest, what to fix, 
what’s fine, etc.
– Visible: Everyone who has a stake in a metric has full visibility into it and 
participates in its definition.
– Predictive: Good metrics enable forecasting of outcomes – e.g., “consistent 
performance test results against build abc predict an x% reduction in 99%ile 
read latency for this workload in prod".
– Consequent: We take actions based on them (e.g., not shipping if tests are 
failing).

Here are some notes in Confluence toward metrics that may be useful to track 
beginning in this phase of the development + release cycle. I’m interested in 
your thoughts on these. They’re also copied inline for easier reading in your 
mail client.

Link: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=93324430

Cheers,

– Scott

––––––

Measuring Release Quality:

[ This document is a draft + sketch of ideas. It is located in the "discussion" 
section of this wiki to indicate that it is an active draft – not a document 
that has been voted on, achieved consensus, or in any way official. ]

Introduction:

This document outlines a series of metrics that may be useful toward measuring 
release quality, and quantifying progress during the testing / validation phase 
of the Apache Cassandra 4.0 release cycle.

The goal of this document is to think through what we should consider measuring 
to quantify our progress testing and validating Apache Cassandra 4.0. This 
document explicitly does not discuss release criteria – though metrics may be a 
useful input to a discussion on that topic.


Metric: Build / Test Health (produced via CI, recorded in Confluence):

Bread-and-butter metrics intended to capture baseline build health, flakiness 
in the test suite, and presented as a time series to understand how they’ve 
changed from build to build and release to release:

Metrics:

– Pass / fail metrics for unit tests
– Pass / fail metrics for dtests
– Flakiness stats for unit and dtests


Metric: “Found Bug” Count by Methodology (sourced via JQL, reported in 
Confluence):

These are intended to help us understand the efficacy of each methodology being 
applied. We might consider annotating bugs found in JIRA with the methodology 
that produced them. This could be consumed as input in a JQL query and reported 
on the Confluence dev wiki.

As we reach a pareto-optimal level of investment in a methodology, we’d expect 
to see its found-bug rate taper. As we achieve higher quality across the board, 
we’d expect to see a tapering in found-bug counts across all methodologies. In 
the event that one or two approaches is an outlier, this could indicate the 
utility of doubling down on a particular form of testing.

We might consider reporting “Found By” counts for methodologies such as:

– Property-based / fuzz testing
– Replay testing
– Upgrade / Diff testing
– Performance testing
– Shadow traffic
– Unit/dtest coverage of new areas
– Source audit


Metric: “Found Bug” Count by Subsystem/Component (sourced via JQL, reported in 
Confluence):

Similar to “found by,” but “found where.” These metrics help us understand 
which components or subsystems of the database we’re finding issues in. In the 
event that a particular area stands out as “hot,” we’ll have the quantitative 
feedback we need to support investment there. Tracking these counts over time – 
and their first derivative – the rate – also helps us make statements regarding 
progress in various subsystems. Though we can’t prove a negative (“no bugs have 
been found, therefore there are no bugs”), we gain confidence as their rate 
decreases normalized to the effort we’re putting in.

We might consider reporting “Found In” counts for components as enumerated in 
JIRA, such as:
– Auth
– Build
– Compaction
– Compression
– Core
– CQL
– Distributed Metadata
– …and so on.


Metric: “Found Bug” Count by Severity (sourced via JQL, reported in Confluence)

Similar to “found by/where,” but “how bad”? These metrics help us understand 
the severity of the issues we encounter. As build quality improves, we would 
expect to see decreases in the severity of issues identified. A high rate of 
critical issues identified late in the release cycle would be cause for 
concern, though it may be expected at an earlier time.

These could roughly be sourced from the “Priority” field in JIRA:
– Trivial
– Minor
– Major
– Critical
– Blocker

While “priority” doesn’t map directly to “severity,” it may be a useful proxy. 
Alternately, we could introduce a label intended to represent severity if we’d 
like to make that clear.


Metric: Performance Tests

Performance tests tell us “how fast” (and “how expensive”). There are many 
metrics we could capture here, and a variety of workloads they could be sourced 
from.

I’ll refrain from proposing a particular methodology or reporting structure 
since many have thought about this. From a reporting perspective, I’m inspired 
by Mozilla’s “arewefastyet.com<http://arewefastyet.com>” used to report the 
performance of their Javascript engine relative to Chrome’s: 
https://arewefastyet.com/win10/overview

Having this sort of feedback on a build-by-build basis would help us catch 
regressions, quantify improvements, and provide a baseline against 3.0 and 3.x.


Metric: Code Coverage (/ other static analysis techniques)

It may also be useful to publish metrics from CI on code coverage by 
package/class/method/branch. These might not be useful metrics for “quality” 
(the relationship between code coverage and quality is tenuous).

However, it would be useful to quantify the trend over time between releases, 
and to source a “to-do” list for important but poorly-covered areas of the 
project.


Others:

There are more things we could measure. We won’t want to drown ourselves in 
metrics (or the work required to gather them) –– but there are likely more not 
described here that could be useful to consider.


Convergence Across Metrics:

The thesis of this document is that improvements in each of these areas are 
correlated with increases in quality. Improvements across all areas are 
correlated with an increase in overall release quality. Tracking metrics like 
these provides the quantitative foundation for assessing progress, setting 
goals, and defining criteria. In that sense, they’re not an end – but a 
beginning.

Measuring Release Quality

Reply via email to