Re: [EXTERNAL] Re: Request for Input: CA Incident Reporting

'Paul van Brouwershaven' via CCADB Public Mon, 07 Aug 2023 00:36:51 -0700

Thanks for your contributions to this Aaron, this is very valuable input!

Should we consider adding a topic in the action items regarding the 
effectiveness of requirements (such as the root program, CA/Browser Forum, and 
ETSI) in averting incidents like this? Are there oversights, potential areas 
for improved clarity in language, or additional requirements that warrant 
consideration?


While CAs are required to monitor Bugzilla incidents, by ensuring that the 
requirements are clear and conclusive, we could help prevent similar incidents 
within the ecosystem. Given that incidents from the past may not be readily 
apparent to new CAs or staff and it's hard to look back on all historic 
incidents.

Paul


________________________________
From: 'Aaron Gable' via CCADB Public <[email protected]>
Sent: Friday, August 4, 2023 20:33
To: [email protected] <[email protected]>
Subject: [EXTERNAL] Re: Request for Input: CA Incident Reporting

WARNING: This email originated outside of Entrust.
DO NOT CLICK links or attachments unless you trust the sender and know the 
content is safe.
________________________________
Apologies for double-posting, but I just wanted to let folks know that I've 
updated my 
gist<https://urldefense.com/v3/__https://gist.github.com/aarongable/78167fc1464b6a8a0a7065112ac195e9__;!!FJ-Y8qCqXTj2!b_S07Fre3VfjL86kIMcDccMBZWgf3kdvIS2WX8A--CS5T6dK1HCDd2bSLDTZCGwd_yLhvlQ9kq-TGSPd6VMO1DioeA$>
 to be a full rewrite of the incident reporting requirements 
page<https://urldefense.com/v3/__https://www.ccadb.org/cas/incident-report__;!!FJ-Y8qCqXTj2!b_S07Fre3VfjL86kIMcDccMBZWgf3kdvIS2WX8A--CS5T6dK1HCDd2bSLDTZCGwd_yLhvlQ9kq-TGSPd6VMlJ0ghMQ$>.
 It includes most of the existing verbiage about the purpose, filing timeline, 
and update requirements of reports, and preserves the Audit Incident Report 
section as-is. It then overhauls the Incident Report section to include a 
template and explicit instructions for filling out that template. I don't know 
if this is actually useful to the CCADB Steering Committee, but it seemed like 
the most succinct way to get all my thoughts in one place.

Thanks again,
Aaron

On Thu, Aug 3, 2023 at 4:49 PM Aaron Gable 
<[email protected]<mailto:[email protected]>> wrote:
Hi Clint,

I'm speaking here both as a member of the Let's Encrypt team (and I think we 
write pretty good incident 
reports<https://urldefense.com/v3/__https://wiki.mozilla.org/CA/Responding_To_An_Incident*Examples_of_Good_Practice__;Iw!!FJ-Y8qCqXTj2!b_S07Fre3VfjL86kIMcDccMBZWgf3kdvIS2WX8A--CS5T6dK1HCDd2bSLDTZCGwd_yLhvlQ9kq-TGSPd6VMESTo87w$>),
 and as someone with a decade of experience in incident-response roles, 
including learning from the people who developed and refined Site Reliability 
Engineering at Google.

Fundamentally, the "Incident Reports" that CAs file in Bugzilla are the same as 
what might be called "Incident Postmortems" elsewhere. Quoting from The SRE 
Book<https://urldefense.com/v3/__https://sre.google/sre-book/postmortem-culture/__;!!FJ-Y8qCqXTj2!b_S07Fre3VfjL86kIMcDccMBZWgf3kdvIS2WX8A--CS5T6dK1HCDd2bSLDTZCGwd_yLhvlQ9kq-TGSPd6VPxoN9DAg$>,
 postmortems are "a written record of an incident, its impact, the actions 
taken to mitigate or resolve it, the root cause(s), and the follow-up actions 
to prevent the incident from recurring". And honestly, I think that the current 
set of questions and 
requirements<https://urldefense.com/v3/__https://www.ccadb.org/cas/incident-report__;!!FJ-Y8qCqXTj2!b_S07Fre3VfjL86kIMcDccMBZWgf3kdvIS2WX8A--CS5T6dK1HCDd2bSLDTZCGwd_yLhvlQ9kq-TGSPd6VMlJ0ghMQ$>
 gets pretty close to that mark: CAs are explicitly required to address the 
root cause, provide a timeline, and commit to follow-up actions.

There are many resources available regarding how to write good incident 
reports, and how to promote good "blameless postmortem" culture within an 
organization. For example, the Google SRE Workbook has an 
example<https://urldefense.com/v3/__https://sre.google/workbook/postmortem-culture/*good-postmortem__;Iw!!FJ-Y8qCqXTj2!b_S07Fre3VfjL86kIMcDccMBZWgf3kdvIS2WX8A--CS5T6dK1HCDd2bSLDTZCGwd_yLhvlQ9kq-TGSPd6VOHcsqEoQ$>
 of a well-written postmortem; 
PagerDuty<https://urldefense.com/v3/__https://response.pagerduty.com/after/post_mortem_template/__;!!FJ-Y8qCqXTj2!b_S07Fre3VfjL86kIMcDccMBZWgf3kdvIS2WX8A--CS5T6dK1HCDd2bSLDTZCGwd_yLhvlQ9kq-TGSPd6VOpMRgZcw$>
 and 
Google<https://urldefense.com/v3/__https://sre.google/sre-book/example-postmortem/__;!!FJ-Y8qCqXTj2!b_S07Fre3VfjL86kIMcDccMBZWgf3kdvIS2WX8A--CS5T6dK1HCDd2bSLDTZCGwd_yLhvlQ9kq-TGSPd6VOHO0ciTg$>
 provide mostly-empty postmortem templates; the Building Secure and Reliable 
Systems book has a chapter on 
postmortems<https://urldefense.com/v3/__https://google.github.io/building-secure-and-reliable-systems/raw/ch18.html*postmortems__;Iw!!FJ-Y8qCqXTj2!b_S07Fre3VfjL86kIMcDccMBZWgf3kdvIS2WX8A--CS5T6dK1HCDd2bSLDTZCGwd_yLhvlQ9kq-TGSPd6VOLipua7Q$>;
 and much more. I particularly love this 
checklist<https://urldefense.com/v3/__https://docs.google.com/document/d/1iaEgF0ICSmKKLG3_BT5VnK80gfOenhhmxVnnUcNSQBE/edit__;!!FJ-Y8qCqXTj2!b_S07Fre3VfjL86kIMcDccMBZWgf3kdvIS2WX8A--CS5T6dK1HCDd2bSLDTZCGwd_yLhvlQ9kq-TGSPd6VONJLXKGQ$>
 of questions to make sure you address in a postmortem.

Refreshing my memory of all of these, I find two big differences between the 
incident reports that we have here in the Web PKI, and those promoted by all of 
these other resources.

1. Our incident reports do not have a "lessons learned" section. We focus on 
what actions were taken, and what actions will be taken, but not on the whys 
behind those actions. Sure, maybe a follow-up item is "add an alert for 
circumstance X" but why is that the appropriate action to take in this 
circumstance? I believe that this is an easy deficiency to remedy, and have 
suggestions for doing so below.

2. We do not have a culture of blameless postmortems. This is much harder to 
resolve. Even though a given report may avoid laying the blame at the feet of 
any individual CA employee, it is difficult to remove the feeling that the CA 
is to blame for the incident as a whole, and punishment (removal from a trust 
store) may be meted out as punishment for too many incidents. I cannot speak 
for other CAs, but here at Let's Encrypt we have carefully cultivate a culture 
of blamelessness when it comes to our incidents and postmortems... and even 
here, the act of writing a report to be publicly posted on Bugzilla is 
nerve-wracking due to fear of criticism and censure. I honestly don't know if 
there's anything we can do about this. The nature of the WebPKI ecosystem and 
the asymmetric roles of root programs and CAs are facts we just have to deal 
with. But at the very least I think it is important to keep this dynamic in 
mind.

So with all that said, I do have a few concrete suggestions for how to improve 
the incident report requirements.

1. Provide a template. Not just a list of (currently very verbose) questions, 
but a verbatim template, with markdown formatting characters (e.g. for 
headings) already included. This will make both writing and reading incident 
reports significantly easier, and remove much ambiguity. It will also have nice 
minor effects like establishing a standard format for the timeline. I'm more 
than happy to contribute the template that Let's Encrypt uses internally, and 
make changes / improvements to it based on my other feedback here.

2. Require an executive summary. Many of the best-written incident reports 
already include a summary at the top, because it provides just enough context 
for the rest of the report to make sense to a new reader.

3. Remove the "how you first became aware" question. This should be built into 
the timeline, not a question of its own. In my experience, this question leads 
to the most repetition of content in the report.

4. Require that the timeline specifically call out the following events:
- Any policy, process, or software changes that contributed to the root cause 
or the trigger
- The time at which the incident began
- The time at which the CA became aware of the incident
- The time at which the incident ended
- The time(s) at which issuance ceased and resumed, if relevant

5. Questions 4 (summary of affected certs) and 5 (full details) should be 
revamped. Having these as separate questions back-to-back places undue emphasis 
on the external impact of the incident, when what we care about much more is 
the internal impact on the CA going forward (i.e. what changes they're making 
to learn from and prevent similar incidents). The summary should be moved to 
directly below the Executive Summary, and turned into a more general "Impact" 
section -- how many certs, how many ocsp responses, how many days, whatever 
statistic is relevant can be provided here. The full details should be moved to 
the very bottom: the list of all affected certificates is usually an 
attachment, so this section should be an appendix.

6. Change question 6 to explicitly call for a root cause analysis. The current 
phrasing ("how and why the mistakes were made") lends itself to a 
blameful-postmortem culture. Instead, we should ask CAs to interrogate what set 
of circumstances combined to allow the incident to arise, and then what final 
trigger caused it to actually occur. This root cause / trigger approach is 
espoused by most of the postmortem guides I linked above.

7. There should be one additional question for "lessons learned". The most 
common three sub-headings here are "What went well", "What didn't go well", and 
"Where we got lucky". The first is very valuable in a blameless postmortem 
culture, because it allows the team to toot its own horn: be proud of the fact 
that the impact was smaller than it would have been if this other mitigation 
hadn't been in place, celebrate the fact that an early warning detection system 
caught it, etc. The second and third strongly inform the set of follow-up 
action items: everything that went wrong should have an action item designed to 
make it go right next time, and every lucky break should have an action item 
design to turn that luck into a guarantee.

8. The action items question should also ask for what kind of action each is: 
does it detect a future occurrence, does it prevent a future occurrence, or 
does it mitigate the effects of a future occurrence? CAs should be encouraged 
(but not required) to include action items of all three types, with an emphasis 
on prevention and mitigation.

Okay, that ended up being more than a few. I also put together a rough-draft of 
my suggested 
template<https://urldefense.com/v3/__https://gist.github.com/aarongable/78167fc1464b6a8a0a7065112ac195e9__;!!FJ-Y8qCqXTj2!b_S07Fre3VfjL86kIMcDccMBZWgf3kdvIS2WX8A--CS5T6dK1HCDd2bSLDTZCGwd_yLhvlQ9kq-TGSPd6VMO1DioeA$>
 for people to look at and critique and improve.

Finally, I have one last suggestion for how the incident reporting process 
could be improved outside of the contents of the report itself.

1. It would be great to automate the process of setting "Next-Update" dates on 
tickets. I feel like I've had several instances where I requested a Next-Update 
date four or five weeks in the future, but then didn't get confirmation that 
this would be okay until just hours before I would have needed to post a weekly 
update. If this process could be flipped -- the Next-Update date gets set 
automatically based on the Action Items, and weekly updates are only necessary 
if a root program manager explicitly unsets it and requests more frequent 
updates -- that would certainly streamline the process a bit.

Apologies for the length of this email. I hope that this is helpful, and gives 
people a good jumping-off point for further discussion of how these incident 
reports should be formatted and what information they should contain to be 
maximally useful to the community.

Thanks,
Aaron

On Tue, Aug 1, 2023 at 7:23 AM 'Clint Wilson' via CCADB Public 
<[email protected]<mailto:[email protected]>> wrote:
Hi all,

If you have feedback on this topic, we would love to hear your thoughts.

Thank you!
-Clint

On Jul 20, 2023, at 8:19 AM, 'Clint Wilson' via CCADB Public 
<[email protected]<mailto:[email protected]>> wrote:

All,

During the CA/Browser Forum Face-to-Face 59 meeting, several Root Store 
Programs expressed an interest in improving Web PKI incident reporting.

The CCADB Steering Committee is interested in this community’s recommendations 
on improving the standards applicable to and the overall quality of incident 
reports submitted by Certification Authority (CA) Owners. We aim to facilitate 
effective collaboration, foster transparency, and promote the sharing of best 
practices and lessons learned among CAs and the broader community.

Currently, some Root Store Programs require incident reports from CA Owners to 
address a list of items in a format detailed on 
ccadb.org<https://urldefense.com/v3/__http://ccadb.org/__;!!FJ-Y8qCqXTj2!b_S07Fre3VfjL86kIMcDccMBZWgf3kdvIS2WX8A--CS5T6dK1HCDd2bSLDTZCGwd_yLhvlQ9kq-TGSPd6VMxJhfb1w$>
 [1]. While the CCADB format provides a framework for reporting, we would like 
to discuss ideas on how to improve the quality and usefulness of these reports.

We would like to make incident reports more useful and effective where they:


  *
Are consistent in quality, transparency, and format.
  *
Demonstrate thoroughness and depth of investigation and incident analysis, 
including for variants.
  *
Clearly identify the true root cause(s) while avoiding restating the issue.
  *
Provide sufficient detail that enables other CA Owners or members of the public 
to comprehend and, where relevant, implement an equivalent solution.
  *
Present a complete timeline of the incident, including the introduction of the 
root cause(s).
  *
Include specific, actionable, and timebound steps for resolving the issue(s) 
that contributed to the root cause(s).
  *
Are frequently updated when new information is found and steps for resolution 
are completed, delayed, or changed.
  *
Allow a reader to quickly understand what happened, the scope of the impact, 
and how the remediation will sufficiently prevent the root cause of the 
incident from reoccuring.

We appreciate, to state it lightly, members of this community and the general 
public who generate and review reports, offer their understanding of the 
situation and impact, and ask clarifying questions.

Call to action: In the spirit of continuous improvement, we are requesting (and 
very much appreciate) this community’s suggestions for how CA incident 
reporting can be improved.

Not every suggestion will be implemented, but we will commit to reviewing all 
suggestions and collectively working towards an improved standard.

Thank you
-Clint, on behalf of the CCADB Steering Committee

[1] 
https://www.ccadb.org/cas/incident-report<https://urldefense.com/v3/__https://www.ccadb.org/cas/incident-report__;!!FJ-Y8qCqXTj2!b_S07Fre3VfjL86kIMcDccMBZWgf3kdvIS2WX8A--CS5T6dK1HCDd2bSLDTZCGwd_yLhvlQ9kq-TGSPd6VMlJ0ghMQ$>

--
You received this message because you are subscribed to the Google Groups 
"CCADB Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected]<mailto:[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/a/ccadb.org/d/msgid/public/3B253FFF-4070-4F0E-95D2-166FAC01C5A7%40apple.com<https://urldefense.com/v3/__https://groups.google.com/a/ccadb.org/d/msgid/public/3B253FFF-4070-4F0E-95D2-166FAC01C5A7*40apple.com?utm_medium=email&utm_source=footer__;JQ!!FJ-Y8qCqXTj2!b_S07Fre3VfjL86kIMcDccMBZWgf3kdvIS2WX8A--CS5T6dK1HCDd2bSLDTZCGwd_yLhvlQ9kq-TGSPd6VNcWUoGYg$>.


--
You received this message because you are subscribed to the Google Groups 
"CCADB Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected]<mailto:[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/a/ccadb.org/d/msgid/public/6519E364-F7FB-438F-9D18-AFF416554857%40apple.com<https://urldefense.com/v3/__https://groups.google.com/a/ccadb.org/d/msgid/public/6519E364-F7FB-438F-9D18-AFF416554857*40apple.com?utm_medium=email&utm_source=footer__;JQ!!FJ-Y8qCqXTj2!b_S07Fre3VfjL86kIMcDccMBZWgf3kdvIS2WX8A--CS5T6dK1HCDd2bSLDTZCGwd_yLhvlQ9kq-TGSPd6VPW5xnw2Q$>.

--
You received this message because you are subscribed to the Google Groups 
"CCADB Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected]<mailto:[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/a/ccadb.org/d/msgid/public/CAEmnErfqHLrq3P6ed%2B6rQh30CNomDeJFkQ39bp5aiVLvXzqpjg%40mail.gmail.com<https://urldefense.com/v3/__https://groups.google.com/a/ccadb.org/d/msgid/public/CAEmnErfqHLrq3P6ed*2B6rQh30CNomDeJFkQ39bp5aiVLvXzqpjg*40mail.gmail.com?utm_medium=email&utm_source=footer__;JSU!!FJ-Y8qCqXTj2!b_S07Fre3VfjL86kIMcDccMBZWgf3kdvIS2WX8A--CS5T6dK1HCDd2bSLDTZCGwd_yLhvlQ9kq-TGSPd6VNi4jyIZQ$>.
Any email and files/attachments transmitted with it are intended solely for the 
use of the individual or entity to whom they are addressed. If this message has 
been sent to you in error, you must not copy, distribute or disclose of the 
information it contains. Please notify Entrust immediately and delete the 
message from your system.

-- 
You received this message because you are subscribed to the Google Groups 
"CCADB Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/a/ccadb.org/d/msgid/public/LV2PR11MB59757ECBA91A95A61D040563F80CA%40LV2PR11MB5975.namprd11.prod.outlook.com.

Re: [EXTERNAL] Re: Request for Input: CA Incident Reporting

Reply via email to