On Wednesday, August 21, 2019 at 3:43:21 PM UTC-4, Ryan Sleevi wrote:
> (Apologies if this triple or quadruple posts. There appears to be some
> hiccups somewhere along the line between my mail server and the m.d.s.p.
> mail server and the Google Groups reflector)
> 
> I've recently shared some choice words with several CAs over their Incident
> Reporting process, highlighting to them how their approach is seriously
> undermining trust in their CA and the operations.
> 
> While https://wiki.mozilla.org/CA/Responding_To_An_Incident provides
> Guidance on the minimum expectations for Incident Reports, and while it
> includes several examples of some reports that are considered great
> responses, it seems there's still some confusion about the underlying
> principles about what makes a good incident report.
> 
> These principles are touched on in "Follow-up Actions", which was
> excellently drafted by Wayne and Kathleen, but I thought it might help to
> capture some of the defining characteristics of a good incident report.
> 
> 1) A good incident report will acknowledge that there's an issue
> 
> While I originally thought about calling this "blameless", I think that
> might still trip some folks up. If an Incident happens, it means
> something's gone wrong. The Incident Report is not about trying to figure
> out who to blame for the incident.
> 
> For example, when the Incident is expressed as, say, "The Validation
> Specialist didn't validate the fields correctly", that's a form of blame.
> It makes it seem that the CA is trying to deflect the issue, pretending it
> was just a one-off human error rather than trying to understand why the
> system placed such huge dependency on humans.
> 
> However, this can also manifest in other ways. "The BRs weren't clear about
> this" is, in some ways, a form of blame and trying to deflect. I am
> certainly not trying to suggest the BRs are perfect, but when something
> like that comes up, it's worth asking what steps the CA has in place to
> review the BRs, to check interpretations, to solicit feedback. It's also an
> area where, for example, better or more comprehensive documentation about
> what the CA does (e.g. in its CP/CPS) could have caused, during the CP/CPS
> review or other engagement, the community to recognize the BRs weren't
> clear, and that the implemented result wasn't the intended result.
> 
> In essence, every Incident Report should help us learn how to make the Web
> PKI better. Dismissing things as one offs, such as human error or
> confusion, gets in the way of understanding the systemic issues at play.
> 
> 2) A good incident report will demonstrate that the CA understands the
> issue, while also providing sufficient information so that anyone else can
> understand the issue
> 
> A good incident report is going to be like a story. It's going to start
> with an opening, introducing the characters and their circumstances. There
> will be some conflict they encounter along the way, and hopefully, by the
> end of the story, the conflict will have been resolved and everyone lives
> happily ever after. But there's a big difference between reading a book
> jacket or review and reading the actual book - and a good incident report
> is going to read like a book, investing in the characters and their story.
> 
> Which is to say, a good incident report is going to have a lot more detail
> than just learning out who the actors are. This plays very closely with the
> previous principle here; a CA that blames it on human error is not one that
> seems like they're acknowledging or understanding the incident, while a CA
> that shares the story of what a day in the life of a validation agent looks
> like, and all the places for things to go wrong or which could have been
> automated or improved really shows they "get" it: that being a validation
> agent is hard, and we should all do everything we can to make it easier for
> them to do their jobs.
> 
> This is the principle behind the template's questions about timelines and
> details: trying to express that the CA needs to share the story about what
> happened, when it happened, where things went wrong, and why, at many
> layers. A timeline that only captures when the failure happened is a bit
> like saying that the only thing that happens in "Lord of the Rings" is
> "Frodo gets rid of some old jewelry"
> 
> 3) A good incident report will identify solutions that generalize for CAs
> 
> The point of incident reports is not to drag CAs or to call them out. It's
> to identify opportunities, as an industry, that we can and should be
> improving. A good incident report is going to look and identify solutions
> that can and do generalize. While it's absolutely expected that the CA
> should fix it for themselves, asking what can or should systemically be
> done.
> 
> This is the difference between saying "We'll be evaluating whether we use
> [data source X]" and saying "We'll be publishing our allowlist of data
> sources that we use". It's implementing linters, if that's something that
> can be done. It's about sharing the full details of what you're doing, so
> if other CAs wanted to (or were required to!) implement something similar,
> they could learn from the CA and the incident report about what works and
> what doesn't work.
> 
> 4) A *great* incident report will actually take the steps to generalize it
> for all CAs.
> 
> This might mean starting discussions on m.d.s.p. about how to solve it via
> policy. It might mean proposing actual changes to the BRs or EVGs - as in,
> writing ballots, not just suggesting "someone" should do it. It's about
> investing the time and energy to make the ecosystem better, more
> transparent, more accountable, and more secure. It might even mean looking
> through CT for other CAs that have the issue, and reporting that as well!
> 
> 
> The primary goal of Incident Reports is not about score keeping. It's not
> about saying who has the most incidents. It's about understanding the
> challenges and actually working to improve them. All CAs are accountable
> for their actions, and so yes, it does mean that there may be multiple
> simultaneous incidents, from separate CAs, for the same issue. That's why
> understanding these principles is so important: we should be collaborating
> to build and systematize this knowledge.
> 
> If a CA keeps having issues, that's going to be a huge red flag. The best
> thing that CA can do, when finding they're repeatedly having issues, is to
> try to push the boundaries forward on Incident Reporting and the ecosystem.
> If their failures help make the Web better, that's a huge benefit to the
> ecosystem, and can significantly factor in how the incidents are evaluated.
> Yet if their incident reports are just scratching the bare minimum -
> delayed, lacking information, argumentative, dismissive of issues, not
> building solutions but instead layering on workarounds for JUST the issue
> noticed - then they're dragging the whole ecosystem down, and creating a
> Wiki to track those sorts of systemic failures may be the right thing to do.
> 
> That's not trying to be a threat, but it's trying to make it very clear
> that the most important thing about the Incident Report is not who had it,
> but how the Web PKI ecosystem improved as a result of it. A CA that learns
> from its mistakes, and helps us all improve - with concrete changes,
> reusable technology, clearer requirements - then I'd much rather have that
> as an Incident Report than have none at all. As strange as it sounds, a
> good incident report _should_ be a competitive advantage, because it's a
> chance to show the CA can learn from, improve, and lead the Web PKI
> ecosystem.
> 
> If you read the example Incident Reports, these are great examples of just
> that. Sometimes they didn't hit everything right out the door, but by the
> end of the report, you'll find these principles are all at play. More great
> examples like that make a huge difference.

Thank you for posting this as it re-emphasizes the need for CAs to take ALL 
incident reports seriously, no matter what they believe to be the severity of 
the case. Even a typing error can be indicative of something more systemic, 
hence it is incumbent upon the CA to do a full investigation and post the 
report here. 

Blame isn't the reason for the report, but rather assisting the community to 
determine if they may have similar issues that they should look into to before 
they also become the victim of a similar incident. These reports serve to 
assist all and we have seen Mozilla emulate a "continuous improvement" process 
by using this feedback to update their own policies. 

I love the "story" analogy. It doesn't have to make the NY Times Best Seller 
list, but let's strive to make each report as thorough and informative as 
possible.
_______________________________________________
dev-security-policy mailing list
dev-security-policy@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security-policy

Reply via email to