Re: Request for Input: CA Incident Reporting

'Aaron Gable' via CCADB Public Thu, 03 Aug 2023 16:49:30 -0700

Hi Clint,

I'm speaking here both as a member of the Let's Encrypt team (and I think
we write pretty good incident reports
<https://wiki.mozilla.org/CA/Responding_To_An_Incident#Examples_of_Good_Practice>),
and as someone with a decade of experience in incident-response roles,
including learning from the people who developed and refined Site
Reliability Engineering at Google.

Fundamentally, the "Incident Reports" that CAs file in Bugzilla are the
same as what might be called "Incident Postmortems" elsewhere. Quoting from The
SRE Book <https://sre.google/sre-book/postmortem-culture/>, postmortems are
"a written record of an incident, its impact, the actions taken to mitigate
or resolve it, the root cause(s), and the follow-up actions to prevent the
incident from recurring". And honestly, I think that the current set of
questions and requirements <https://www.ccadb.org/cas/incident-report> gets
pretty close to that mark: CAs are explicitly required to address the root
cause, provide a timeline, and commit to follow-up actions.

There are many resources available regarding how to write good incident
reports, and how to promote good "blameless postmortem" culture within an
organization. For example, the Google SRE Workbook has an example
<https://sre.google/workbook/postmortem-culture/#good-postmortem> of a
well-written postmortem; PagerDuty
<https://response.pagerduty.com/after/post_mortem_template/> and Google
<https://sre.google/sre-book/example-postmortem/> provide mostly-empty
postmortem templates; the Building Secure and Reliable Systems book
has a chapter
on postmortems
<https://google.github.io/building-secure-and-reliable-systems/raw/ch18.html#postmortems>;
and much more. I particularly love this checklist
<https://docs.google.com/document/d/1iaEgF0ICSmKKLG3_BT5VnK80gfOenhhmxVnnUcNSQBE/edit>
of
questions to make sure you address in a postmortem.

Refreshing my memory of all of these, I find two big differences between
the incident reports that we have here in the Web PKI, and those promoted
by all of these other resources.

1. Our incident reports do not have a "lessons learned" section. We focus
on what actions were taken, and what actions will be taken, but not on the
*why*s behind those actions. Sure, maybe a follow-up item is "add an alert
for circumstance X" but why is that the appropriate action to take in this
circumstance? I believe that this is an easy deficiency to remedy, and
have suggestions for doing so below.

2. We do not have a culture of blameless postmortems. This is much harder
to resolve. Even though a given report may avoid laying the blame at the
feet of any individual CA employee, it is difficult to remove the feeling
that *the CA* is to blame for the incident as a whole, and punishment
(removal from a trust store) may be meted out as punishment for too many
incidents. I cannot speak for other CAs, but here at Let's Encrypt we have
carefully cultivate a culture of blamelessness when it comes to our
incidents and postmortems... and even here, the act of writing a report to
be publicly posted on Bugzilla is nerve-wracking due to fear of criticism
and censure. I honestly don't know if there's anything we can do about
this. The nature of the WebPKI ecosystem and the asymmetric roles of root
programs and CAs are facts we just have to deal with. But at the very least
I think it is important to keep this dynamic in mind.

So with all that said, I do have a few concrete suggestions for how to
improve the incident report requirements.

1. Provide a template. Not just a list of (currently very verbose)
questions, but a verbatim template, with markdown formatting characters
(e.g. for headings) already included. This will make both writing and
reading incident reports significantly easier, and remove much ambiguity.
It will also have nice minor effects like establishing a standard format
for the timeline. I'm more than happy to contribute the template that Let's
Encrypt uses internally, and make changes / improvements to it based on my
other feedback here.

2. Require an executive summary. Many of the best-written incident reports
already include a summary at the top, because it provides just enough
context for the rest of the report to make sense to a new reader.

3. Remove the "how you first became aware" question. This should be built
into the timeline, not a question of its own. In my experience, this
question leads to the most repetition of content in the report.

4. Require that the timeline specifically call out the following events:
- Any policy, process, or software changes that contributed to the root
cause or the trigger
- The time at which the incident began
- The time at which the CA became aware of the incident
- The time at which the incident ended
- The time(s) at which issuance ceased and resumed, if relevant

5. Questions 4 (summary of affected certs) and 5 (full details) should be
revamped. Having these as separate questions back-to-back places undue
emphasis on the *external* impact of the incident, when what we care about
much more is the *internal* impact on the CA going forward (i.e. what
changes they're making to learn from and prevent similar incidents). The
summary should be moved to directly below the Executive Summary, and turned
into a more general "Impact" section -- how many certs, how many ocsp
responses, how many days, whatever statistic is relevant can be provided
here. The full details should be moved to the very bottom: the list of all
affected certificates is usually an attachment, so this section should be
an appendix.

6. Change question 6 to explicitly call for a root cause analysis. The
current phrasing ("how and why the mistakes were made") lends itself to a
blameful-postmortem culture. Instead, we should ask CAs to interrogate what
set of circumstances combined to allow the incident to arise, and then what
final trigger caused it to actually occur. This root cause / trigger
approach is espoused by most of the postmortem guides I linked above.

7. There should be one additional question for "lessons learned". The most
common three sub-headings here are "What went well", "What didn't go well",
and "Where we got lucky". The first is very valuable in a blameless
postmortem culture, because it allows the team to toot its own horn: be
proud of the fact that the impact was smaller than it would have been if
this other mitigation hadn't been in place, celebrate the fact that an
early warning detection system caught it, etc. The second and third
strongly inform the set of follow-up action items: everything that went
wrong should have an action item designed to make it go right next time,
and every lucky break should have an action item design to turn that luck
into a guarantee.

8. The action items question should also ask for what *kind* of action each
is: does it *detect* a future occurrence, does it *prevent* a future
occurrence, or does it *mitigate* the effects of a future occurrence? CAs
should be encouraged (but not required) to include action items of all
three types, with an emphasis on prevention and mitigation.

Okay, that ended up being more than a few. I also put together a rough-draft
of my suggested template
<https://gist.github.com/aarongable/78167fc1464b6a8a0a7065112ac195e9> for
people to look at and critique and improve.

Finally, I have one last suggestion for how the incident reporting process
could be improved outside of the contents of the report itself.

1. It would be great to automate the process of setting "Next-Update" dates
on tickets. I feel like I've had several instances where I requested a
Next-Update date four or five weeks in the future, but then didn't get
confirmation that this would be okay until just hours before I would have
needed to post a weekly update. If this process could be flipped -- the
Next-Update date gets set automatically based on the Action Items, and
weekly updates are only necessary if a root program manager explicitly
unsets it and requests more frequent updates -- that would certainly
streamline the process a bit.

Apologies for the length of this email. I hope that this is helpful, and
gives people a good jumping-off point for further discussion of how these
incident reports should be formatted and what information they should
contain to be maximally useful to the community.

Thanks,
Aaron

On Tue, Aug 1, 2023 at 7:23 AM 'Clint Wilson' via CCADB Public <
[email protected]> wrote:

> Hi all,
>
> If you have feedback on this topic, we would love to hear your thoughts.
>
> Thank you!
> -Clint
>
> On Jul 20, 2023, at 8:19 AM, 'Clint Wilson' via CCADB Public <
> [email protected]> wrote:
>
> All,
>
> During the CA/Browser Forum Face-to-Face 59 meeting, several Root Store
> Programs expressed an interest in improving Web PKI incident reporting.
>
> The CCADB Steering Committee is interested in this community’s
> recommendations on improving the standards applicable to and the overall
> quality of incident reports submitted by Certification Authority (CA)
> Owners. We aim to facilitate effective collaboration, foster transparency,
> and promote the sharing of best practices and lessons learned among CAs and
> the broader community.
>
> Currently, some Root Store Programs require incident reports from CA
> Owners to address a list of items in a format detailed on ccadb.org [1].
> While the CCADB format provides a framework for reporting, we would like to
> discuss ideas on how to improve the quality and usefulness of these reports.
>
> We would like to make incident reports more useful and effective where
> they:
>
>
>    - Are consistent in quality, transparency, and format.
>    - Demonstrate thoroughness and depth of investigation and incident
>    analysis, including for variants.
>    - Clearly identify the true root cause(s) while avoiding restating the
>    issue.
>    - Provide sufficient detail that enables other CA Owners or members of
>    the public to comprehend and, where relevant, implement an equivalent
>    solution.
>    - Present a complete timeline of the incident, including the
>    introduction of the root cause(s).
>    - Include specific, actionable, and timebound steps for resolving the
>    issue(s) that contributed to the root cause(s).
>    - Are frequently updated when new information is found and steps for
>    resolution are completed, delayed, or changed.
>    - Allow a reader to quickly understand what happened, the scope of the
>    impact, and how the remediation will sufficiently prevent the root cause of
>    the incident from reoccuring.
>
>
> We appreciate, to state it lightly, members of this community and the
> general public who generate and review reports, offer their understanding
> of the situation and impact, and ask clarifying questions.
>
> Call to action: In the spirit of continuous improvement, we are
> requesting (and very much appreciate) this community’s suggestions for how
> CA incident reporting can be improved.
>
> Not every suggestion will be implemented, but we will commit to reviewing
> all suggestions and collectively working towards an improved standard.
>
> Thank you
> -Clint, on behalf of the CCADB Steering Committee
>
> [1] https://www.ccadb.org/cas/incident-report
>
> --
> You received this message because you are subscribed to the Google Groups
> "CCADB Public" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/a/ccadb.org/d/msgid/public/3B253FFF-4070-4F0E-95D2-166FAC01C5A7%40apple.com
> <https://groups.google.com/a/ccadb.org/d/msgid/public/3B253FFF-4070-4F0E-95D2-166FAC01C5A7%40apple.com?utm_medium=email&utm_source=footer>
> .
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "CCADB Public" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/a/ccadb.org/d/msgid/public/6519E364-F7FB-438F-9D18-AFF416554857%40apple.com
> <https://groups.google.com/a/ccadb.org/d/msgid/public/6519E364-F7FB-438F-9D18-AFF416554857%40apple.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"CCADB Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/a/ccadb.org/d/msgid/public/CAEmnErfG_7keQfLRxrWXiiBd%3DvswgFBGpuG5ep_fm8rbfPFrWg%40mail.gmail.com.

Re: Request for Input: CA Incident Reporting

Reply via email to