(Apologies if this double posts; (my || the) e-mail gateway seems to be having 
some trouble so I'm trying this through the Google Groups interface)

I've recently shared some choice words with several CAs over their Incident 
Reporting process, highlighting to them how their approach is seriously 
undermining trust in their CA and the operations. 

While https://wiki.mozilla.org/CA/Responding_To_An_Incident provides Guidance 
on the minimum expectations for Incident Reports, and while it includes several 
examples of some reports that are considered great responses, it seems there's 
still some confusion about the underlying principles about what makes a good 
incident report.

These principles are touched on in "Follow-up Actions", which was excellently 
drafted by Wayne and Kathleen, but I thought it might help to capture some of 
the defining characteristics of a good incident report.


1) A good incident report will acknowledge that there's an issue

While I originally thought about calling this "blameless", I think that might 
still trip some folks up. If an Incident happens, it means something's gone 
wrong. The Incident Report is not about trying to figure out who to blame for 
the incident.

For example, when the Incident is expressed as, say, "The Validation Specialist 
didn't validate the fields correctly", that's a form of blame. It makes it seem 
that the CA is trying to deflect the issue, pretending it was just a one-off 
human error rather than trying to understand why the system placed such huge 
dependency on humans.

However, this can also manifest in other ways. "The BRs weren't clear about 
this" is, in some ways, a form of blame and trying to deflect. I am certainly 
not trying to suggest the BRs are perfect, but when something like that comes 
up, it's worth asking what steps the CA has in place to review the BRs, to 
check interpretations, to solicit feedback. It's also an area where, for 
example, better or more comprehensive documentation about what the CA does 
(e.g. in its CP/CPS) could have caused, during the CP/CPS review or other 
engagement, the community to recognize the BRs weren't clear, and that the 
implemented result wasn't the intended result.

In essence, every Incident Report should help us learn how to make the Web PKI 
better. Dismissing things as one offs, such as human error or confusion, gets 
in the way of understanding the systemic issues at play.


2) A good incident report will demonstrate that the CA understands the issue, 
while also providing sufficient information so that anyone else can understand 
the issue

A good incident report is going to be like a story. It's going to start with an 
opening, introducing the characters and their circumstances. There will be some 
conflict they encounter along the way, and hopefully, by the end of the story, 
the conflict will have been resolved and everyone lives happily ever after. But 
there's a big difference between reading a book jacket or review and reading 
the actual book - and a good incident report is going to read like a book, 
investing in the characters and their story.

Which is to say, a good incident report is going to have a lot more detail than 
just learning out who the actors are. This plays very closely with the previous 
principle here; a CA that blames it on human error is not one that seems like 
they're acknowledging or understanding the incident, while a CA that shares the 
story of what a day in the life of a validation agent looks like, and all the 
places for things to go wrong or which could have been automated or improved 
really shows they "get" it: that being a validation agent is hard, and we 
should all do everything we can to make it easier for them to do their jobs.

This is the principle behind the template's questions about timelines and 
details: trying to express that the CA needs to share the story about what 
happened, when it happened, where things went wrong, and why, at many layers. A 
timeline that only captures when the failure happened is a bit like saying that 
the only thing that happens in "Lord of the Rings" is "Frodo gets rid of some 
old jewelry"


3) A good incident report will identify solutions that generalize for CAs

The point of incident reports is not to drag CAs or to call them out. It's to 
identify opportunities, as an industry, that we can and should be improving. A 
good incident report is going to look and identify solutions that can and do 
generalize. While it's absolutely expected that the CA should fix it for 
themselves, asking what can or should systemically be done.

This is the difference between saying "We'll be evaluating whether we use [data 
source X]" and saying "We'll be publishing our allowlist of data sources that 
we use". It's implementing linters, if that's something that can be done. It's 
about sharing the full details of what you're doing, so if other CAs wanted to 
(or were required to!) implement something similar, they could learn from the 
CA and the incident report about what works and what doesn't work.


4) A *great* incident report will actually take the steps to generalize it for 
all CAs.

This might mean starting discussions on m.d.s.p. about how to solve it via 
policy. It might mean proposing actual changes to the BRs or EVGs - as in, 
writing ballots, not just suggesting "someone" should do it. It's about 
investing the time and energy to make the ecosystem better, more transparent, 
more accountable, and more secure. It might even mean looking through CT for 
other CAs that have the issue, and reporting that as well!


The primary goal of Incident Reports is not about score keeping. It's not about 
saying who has the most incidents. It's about understanding the challenges and 
actually working to improve them. All CAs are accountable for their actions, 
and so yes, it does mean that there may be multiple simultaneous incidents, 
from separate CAs, for the same issue. That's why understanding these 
principles is so important: we should be collaborating to build and systematize 
this knowledge.

If a CA keeps having issues, that's going to be a huge red flag. The best thing 
that CA can do, when finding they're repeatedly having issues, is to try to 
push the boundaries forward on Incident Reporting and the ecosystem. If their 
failures help make the Web better, that's a huge benefit to the ecosystem, and 
can significantly factor in how the incidents are evaluated. Yet if their 
incident reports are just scratching the bare minimum - delayed, lacking 
information, argumentative, dismissive of issues, not building solutions but 
instead layering on workarounds for JUST the issue noticed - then they're 
dragging the whole ecosystem down, and creating a Wiki to track those sorts of 
systemic failures may be the right thing to do.

That's not trying to be a threat, but it's trying to make it very clear that 
the most important thing about the Incident Report is not who had it, but how 
the Web PKI ecosystem improved as a result of it. A CA that learns from its 
mistakes, and helps us all improve - with concrete changes, reusable 
technology, clearer requirements - then I'd much rather have that as an 
Incident Report than have none at all. As strange as it sounds, a good incident 
report _should_ be a competitive advantage, because it's a chance to show the 
CA can learn from, improve, and lead the Web PKI ecosystem.

If you read the example Incident Reports, these are great examples of just 
that. Sometimes they didn't hit everything right out the door, but by the end 
of the report, you'll find these principles are all at play. More great 
examples like that make a huge difference.
_______________________________________________
dev-security-policy mailing list
dev-security-policy@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security-policy

Reply via email to