On Wednesday, August 21, 2019 at 3:43:21 PM UTC-4, Ryan Sleevi wrote:
> (Apologies if this triple or quadruple posts. There appears to be some
> hiccups somewhere along the line between my mail server and the m.d.s.p.
> mail server and the Google Groups reflector)
>
> I've recently shared some choice words with several CAs over their Incident
> Reporting process, highlighting to them how their approach is seriously
> undermining trust in their CA and the operations.
>
> While https://wiki.mozilla.org/CA/Responding_To_An_Incident provides
> Guidance on the minimum expectations for Incident Reports, and while it
> includes several examples of some reports that are considered great
> responses, it seems there's still some confusion about the underlying
> principles about what makes a good incident report.
>
> These principles are touched on in "Follow-up Actions", which was
> excellently drafted by Wayne and Kathleen, but I thought it might help to
> capture some of the defining characteristics of a good incident report.
>
> 1) A good incident report will acknowledge that there's an issue
>
> While I originally thought about calling this "blameless", I think that
> might still trip some folks up. If an Incident happens, it means
> something's gone wrong. The Incident Report is not about trying to figure
> out who to blame for the incident.
>
> For example, when the Incident is expressed as, say, "The Validation
> Specialist didn't validate the fields correctly", that's a form of blame.
> It makes it seem that the CA is trying to deflect the issue, pretending it
> was just a one-off human error rather than trying to understand why the
> system placed such huge dependency on humans.
>
> However, this can also manifest in other ways. "The BRs weren't clear about
> this" is, in some ways, a form of blame and trying to deflect. I am
> certainly not trying to suggest the BRs are perfect, but when something
> like that comes up, it's worth asking what steps the CA has in place to
> review the BRs, to check interpretations, to solicit feedback. It's also an
> area where, for example, better or more comprehensive documentation about
> what the CA does (e.g. in its CP/CPS) could have caused, during the CP/CPS
> review or other engagement, the community to recognize the BRs weren't
> clear, and that the implemented result wasn't the intended result.
>
> In essence, every Incident Report should help us learn how to make the Web
> PKI better. Dismissing things as one offs, such as human error or
> confusion, gets in the way of understanding the systemic issues at play.
>
> 2) A good incident report will demonstrate that the CA understands the
> issue, while also providing sufficient information so that anyone else can
> understand the issue
>
> A good incident report is going to be like a story. It's going to start
> with an opening, introducing the characters and their circumstances. There
> will be some conflict they encounter along the way, and hopefully, by the
> end of the story, the conflict will have been resolved and everyone lives
> happily ever after. But there's a big difference between reading a book
> jacket or review and reading the actual book - and a good incident report
> is going to read like a book, investing in the characters and their story.
>
> Which is to say, a good incident report is going to have a lot more detail
> than just learning out who the actors are. This plays very closely with the
> previous principle here; a CA that blames it on human error is not one that
> seems like they're acknowledging or understanding the incident, while a CA
> that shares the story of what a day in the life of a validation agent looks
> like, and all the places for things to go wrong or which could have been
> automated or improved really shows they "get" it: that being a validation
> agent is hard, and we should all do everything we can to make it easier for
> them to do their jobs.
>
> This is the principle behind the template's questions about timelines and
> details: trying to express that the CA needs to share the story about what
> happened, when it happened, where things went wrong, and why, at many
> layers. A timeline that only captures when the failure happened is a bit
> like saying that the only thing that happens in "Lord of the Rings" is
> "Frodo gets rid of some old jewelry"
>
> 3) A good incident report will identify solutions that generalize for CAs
>
> The point of incident reports is not to drag CAs or to call them out. It's
> to identify opportunities, as an industry, that we can and should be
> improving. A good incident report is going to look and identify solutions
> that can and do generalize. While it's absolutely expected that the CA
> should fix it for themselves, asking what can or should systemically be
> done.
>
> This is the difference between saying "We'll be evaluating whether we use
> [data source X]" and saying "We'll be publishing our allowlist of data
> sources