Re: Request for Input: CA Incident Reporting

Antonios Chariton Mon, 07 Aug 2023 07:00:52 -0700

Thanks for the great content Aaron! I agree on every point, and thanks for even 
making such detailed suggestions.

I’d like to expand a little bit on the incident reporting part, as I think this 
is potentially the greatest blind spot. This is the question of when an 
incident should be filed and when it shouldn’t.

There is legislation in the EU as well as other countries that mandates 
reporting of security incidents. However most companies in reality choose not 
to do that. If for example the Data Protection Authorities received an e-mail 
for every time a data breach happened, it would have been a massive 
undertaking. By talking to these authorities, as well as various companies that 
meet these requirements for reporting, it’s clear that what you say Aaron is 
true.

If you are the “good person” and you report everything, while your competition 
keeps everything hidden, you are rarely rewarded, and usually punished. And 
we’re talking about law here, which comes with stricter punishments than 
removal from a Root Program.

With that in mind, I think a culture of fear won’t help. CAs shouldn’t be 
afraid to file incidents, but at the same time they must own up to their 
mistakes. It’s a difficult balancing act, and every member of this community 
contributes to this constantly. At the end of the day, though, the Root 
Programs are responsible for protecting their users, so we will inevitably 
arrive at a point where a CA must be removed. This should be fine, fair, and 
well understood why.

Personally, I believe that a CA that keeps incidents hidden, downplays the 
severity, or in general isn’t being honest is far more dangerous to the users 
than one that keeps making mistakes but shows clear signs of owning them, 
learning from them, fixing them really and truly, and investing in a culture of 
improvement and engineering excellence. As a Root Program Manager, I would take 
a CA I know any day from one I know nothing about, and have no real visibility 
into, other than a couple of JPEGs once a year. Root Programs have only a tiny 
amount of signal to make decisions with, and CAs have the power of depriving 
them of a lot of it if they want to.

I wouldn’t ever rely on the metric “X incidents reported in QY” to make an 
inclusion or removal decision. I would focus on qualitative metrics across the 
entire operation. How many incidents are self-reported vs externally-forced, 
how open was the CA, was the issue really fixed, were they aware of their role 
and requirements, was the report factual and accurate, did they respond 
quickly, was it a common mistake anyone could make, did they take ownership and 
clean up and mitigate the fallout in a timely manner, etc. Eventually it comes 
down to the question of whether I trust CA X of having the best interest of my 
users in mind, and can keep them secure. Or whether I think they’re competent 
enough to do it. Obviously, you also need to factor in the value each one adds, 
without ever allowing any CA to become Too Big To Fail. After all, there have 
been zero strike removals in the past.

If the Root Programs agree on that, and this is clear, and their actions are 
governed by such “principles” where transparency is clearly valued more than X, 
Y, or Z, then I think we can make more progress. Otherwise it will be a 
whack-a-mole and word / meaning twisting game where the responsible CAs will 
keep filing incidents, even if they could have gotten away with them, and the 
others will not.

Regarding your point about careful wording, this is true, but it’s again part 
of the balancing act. If a CA changes the story 3 times in a report, and it’s a 
new story every time they get backed into a corner with their claims, no matter 
how much goodwill you want to show, you can’t help but entertain the thought of 
something else going on there. And there’s no real way of figuring out which is 
true without messing with this balance. I understand that Root Programs have to 
apply more pressure in certain situations but I’d like to think it’s done to 
extract more signal for their decision makers, and not to make anyone’s life 
miserable or day worse. In my mind, CAs should focus on producing high quality, 
factual, timely, and transparent reports. If that’s the case, things will move 
forward productively, and we won’t go into the cat and mouse game. It is 
difficult, I know, and a few mistakes are okay, as we’re all humans after all.

As a side note, I also understand here that especially in the US for example, 
companies will need to vet every communication, possibly have it reviewed by 
legal counsels, which only adds delays and more “censorship" layers. And I 
guess very few lawyers would advise their clients to publicly admit guilt and 
fault during a commercial activity that could impact their customers. For this, 
I think most of the time it’s just not explained properly to them:

I view Root Programs as companies and CAs as vendors. Every company has to do 
vendor assessments during onboarding (inclusion) as well as periodically. They 
also need to set a number of requirements in their RFP (Root Program 
Requirements), and every vendor that wants to sell their product to this 
company has to comply with them. And of course, it’s up to the company to 
ensure that all its vendors still check all the boxes, including any new ones 
that have to be added. Some times by sending a questionnaire, others by looking 
at an audit result / certification, and others by observing the system 
directly. Otherwise, if they can no longer trust a vendor with their business, 
they have to shop for alternatives. If a lawyer is made aware of this 
relationship, I think they can just go to the right playbook and figure it out. 
It’s always important to explain what you want to do, why, and its importance, 
otherwise the answers will be wrong, or just “no”. If I ask someone whether X 
adds risk, they’ll say “yes”, regardless of X. If I ask someone whether we can 
take the risk of X in order to unlock Y, then it’s a completely different 
answer.[1]

To summarize some of my points, I think it’s important to ensure that there is 
a relationship that’s based on trust from both sides, which I understand takes 
time to build, and until then we’ll need to come up with a more well defined 
list of what is and what isn’t report-worthy. Let’s work on this in an 
evolutionary and not a revolutionary manner. With small iterations until we 
fine tune it. I’d personally err on the side of more reports than fewer for 
now, and we can then analyze the data and figure out what the right next step 
is.

We should always be mindful of course of the load on the CAs as they aren’t 
entities with unlimited money and resources most of the time. There’s even a 
non-profit one! ;) We need to set requirements that provide the Root Programs 
with as much signal as possible, without making the trust store an exclusive 
pool of companies with deep pockets and dedicated “sales engineers”. There is 
definitely a minimum cost however if you want to “sell to {Mozilla, Apple, 
Google, Microsoft}”.

Thanks,
Antonios

- - - - -
Footnote
- - - - -

1: There is a corner case here, Mozilla. My take on this is the following: 
Apple, Google, Microsoft, etc. are for-profit companies that can afford to pay 
FTEs to work on vendor assessments. Mozilla is an open source / 
community-driven organization which can afford *some* FTEs but also accepts 
contributions from anyone following their guidelines and rules that adds value. 
It’s the same with code: I can’t contribute patches to iOS or Gmail or Windows, 
but I can contribute code to Firefox. However, that means that Firefox must be 
open source, people must have access to the bug tracker, ... It’s still a 
vendor assessment however, just one done collaboratively by Mozilla Staff and 
external contributors, and due to its nature, it’s done publicly.

> On 4 Aug 2023, at 21:33, 'Aaron Gable' via CCADB Public <[email protected]> 
> wrote:
> 
> Apologies for double-posting, but I just wanted to let folks know that I've 
> updated my gist 
> <https://gist.github.com/aarongable/78167fc1464b6a8a0a7065112ac195e9> to be a 
> full rewrite of the incident reporting requirements page 
> <https://www.ccadb.org/cas/incident-report>. It includes most of the existing 
> verbiage about the purpose, filing timeline, and update requirements of 
> reports, and preserves the Audit Incident Report section as-is. It then 
> overhauls the Incident Report section to include a template and explicit 
> instructions for filling out that template. I don't know if this is actually 
> useful to the CCADB Steering Committee, but it seemed like the most succinct 
> way to get all my thoughts in one place.
> 
> Thanks again,
> Aaron
> 
> On Thu, Aug 3, 2023 at 4:49 PM Aaron Gable <[email protected] 
> <mailto:[email protected]>> wrote:
>> Hi Clint,
>> 
>> I'm speaking here both as a member of the Let's Encrypt team (and I think we 
>> write pretty good incident reports 
>> <https://wiki.mozilla.org/CA/Responding_To_An_Incident#Examples_of_Good_Practice>),
>>  and as someone with a decade of experience in incident-response roles, 
>> including learning from the people who developed and refined Site 
>> Reliability Engineering at Google.
>> 
>> Fundamentally, the "Incident Reports" that CAs file in Bugzilla are the same 
>> as what might be called "Incident Postmortems" elsewhere. Quoting from The 
>> SRE Book <https://sre.google/sre-book/postmortem-culture/>, postmortems are 
>> "a written record of an incident, its impact, the actions taken to mitigate 
>> or resolve it, the root cause(s), and the follow-up actions to prevent the 
>> incident from recurring". And honestly, I think that the current set of 
>> questions and requirements <https://www.ccadb.org/cas/incident-report> gets 
>> pretty close to that mark: CAs are explicitly required to address the root 
>> cause, provide a timeline, and commit to follow-up actions.
>> 
>> There are many resources available regarding how to write good incident 
>> reports, and how to promote good "blameless postmortem" culture within an 
>> organization. For example, the Google SRE Workbook has an example 
>> <https://sre.google/workbook/postmortem-culture/#good-postmortem> of a 
>> well-written postmortem; PagerDuty 
>> <https://response.pagerduty.com/after/post_mortem_template/> and Google 
>> <https://sre.google/sre-book/example-postmortem/> provide mostly-empty 
>> postmortem templates; the Building Secure and Reliable Systems book has a 
>> chapter on postmortems 
>> <https://google.github.io/building-secure-and-reliable-systems/raw/ch18.html#postmortems>;
>>  and much more. I particularly love this checklist 
>> <https://docs.google.com/document/d/1iaEgF0ICSmKKLG3_BT5VnK80gfOenhhmxVnnUcNSQBE/edit>
>>  of questions to make sure you address in a postmortem.
>> 
>> Refreshing my memory of all of these, I find two big differences between the 
>> incident reports that we have here in the Web PKI, and those promoted by all 
>> of these other resources.
>> 
>> 1. Our incident reports do not have a "lessons learned" section. We focus on 
>> what actions were taken, and what actions will be taken, but not on the whys 
>> behind those actions. Sure, maybe a follow-up item is "add an alert for 
>> circumstance X" but why is that the appropriate action to take in this 
>> circumstance? I believe that this is an easy deficiency to remedy, and have 
>> suggestions for doing so below.
>> 
>> 2. We do not have a culture of blameless postmortems. This is much harder to 
>> resolve. Even though a given report may avoid laying the blame at the feet 
>> of any individual CA employee, it is difficult to remove the feeling that 
>> the CA is to blame for the incident as a whole, and punishment (removal from 
>> a trust store) may be meted out as punishment for too many incidents. I 
>> cannot speak for other CAs, but here at Let's Encrypt we have carefully 
>> cultivate a culture of blamelessness when it comes to our incidents and 
>> postmortems... and even here, the act of writing a report to be publicly 
>> posted on Bugzilla is nerve-wracking due to fear of criticism and censure. I 
>> honestly don't know if there's anything we can do about this. The nature of 
>> the WebPKI ecosystem and the asymmetric roles of root programs and CAs are 
>> facts we just have to deal with. But at the very least I think it is 
>> important to keep this dynamic in mind.
>> 
>> So with all that said, I do have a few concrete suggestions for how to 
>> improve the incident report requirements.
>> 
>> 1. Provide a template. Not just a list of (currently very verbose) 
>> questions, but a verbatim template, with markdown formatting characters 
>> (e.g. for headings) already included. This will make both writing and 
>> reading incident reports significantly easier, and remove much ambiguity. It 
>> will also have nice minor effects like establishing a standard format for 
>> the timeline. I'm more than happy to contribute the template that Let's 
>> Encrypt uses internally, and make changes / improvements to it based on my 
>> other feedback here.
>> 
>> 2. Require an executive summary. Many of the best-written incident reports 
>> already include a summary at the top, because it provides just enough 
>> context for the rest of the report to make sense to a new reader.
>> 
>> 3. Remove the "how you first became aware" question. This should be built 
>> into the timeline, not a question of its own. In my experience, this 
>> question leads to the most repetition of content in the report.
>> 
>> 4. Require that the timeline specifically call out the following events:
>> - Any policy, process, or software changes that contributed to the root 
>> cause or the trigger
>> - The time at which the incident began
>> - The time at which the CA became aware of the incident
>> - The time at which the incident ended
>> - The time(s) at which issuance ceased and resumed, if relevant
>> 
>> 5. Questions 4 (summary of affected certs) and 5 (full details) should be 
>> revamped. Having these as separate questions back-to-back places undue 
>> emphasis on the external impact of the incident, when what we care about 
>> much more is the internal impact on the CA going forward (i.e. what changes 
>> they're making to learn from and prevent similar incidents). The summary 
>> should be moved to directly below the Executive Summary, and turned into a 
>> more general "Impact" section -- how many certs, how many ocsp responses, 
>> how many days, whatever statistic is relevant can be provided here. The full 
>> details should be moved to the very bottom: the list of all affected 
>> certificates is usually an attachment, so this section should be an appendix.
>> 
>> 6. Change question 6 to explicitly call for a root cause analysis. The 
>> current phrasing ("how and why the mistakes were made") lends itself to a 
>> blameful-postmortem culture. Instead, we should ask CAs to interrogate what 
>> set of circumstances combined to allow the incident to arise, and then what 
>> final trigger caused it to actually occur. This root cause / trigger 
>> approach is espoused by most of the postmortem guides I linked above.
>> 
>> 7. There should be one additional question for "lessons learned". The most 
>> common three sub-headings here are "What went well", "What didn't go well", 
>> and "Where we got lucky". The first is very valuable in a blameless 
>> postmortem culture, because it allows the team to toot its own horn: be 
>> proud of the fact that the impact was smaller than it would have been if 
>> this other mitigation hadn't been in place, celebrate the fact that an early 
>> warning detection system caught it, etc. The second and third strongly 
>> inform the set of follow-up action items: everything that went wrong should 
>> have an action item designed to make it go right next time, and every lucky 
>> break should have an action item design to turn that luck into a guarantee.
>> 
>> 8. The action items question should also ask for what kind of action each 
>> is: does it detect a future occurrence, does it prevent a future occurrence, 
>> or does it mitigate the effects of a future occurrence? CAs should be 
>> encouraged (but not required) to include action items of all three types, 
>> with an emphasis on prevention and mitigation.
>> 
>> Okay, that ended up being more than a few. I also put together a rough-draft 
>> of my suggested template 
>> <https://gist.github.com/aarongable/78167fc1464b6a8a0a7065112ac195e9> for 
>> people to look at and critique and improve.
>> 
>> Finally, I have one last suggestion for how the incident reporting process 
>> could be improved outside of the contents of the report itself.
>> 
>> 1. It would be great to automate the process of setting "Next-Update" dates 
>> on tickets. I feel like I've had several instances where I requested a 
>> Next-Update date four or five weeks in the future, but then didn't get 
>> confirmation that this would be okay until just hours before I would have 
>> needed to post a weekly update. If this process could be flipped -- the 
>> Next-Update date gets set automatically based on the Action Items, and 
>> weekly updates are only necessary if a root program manager explicitly 
>> unsets it and requests more frequent updates -- that would certainly 
>> streamline the process a bit.
>> 
>> Apologies for the length of this email. I hope that this is helpful, and 
>> gives people a good jumping-off point for further discussion of how these 
>> incident reports should be formatted and what information they should 
>> contain to be maximally useful to the community.
>> 
>> Thanks,
>> Aaron
>> 
>> On Tue, Aug 1, 2023 at 7:23 AM 'Clint Wilson' via CCADB Public 
>> <[email protected] <mailto:[email protected]>> wrote:
>>> Hi all,
>>> 
>>> If you have feedback on this topic, we would love to hear your thoughts.
>>> 
>>> Thank you!
>>> -Clint
>>> 
>>>> On Jul 20, 2023, at 8:19 AM, 'Clint Wilson' via CCADB Public 
>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>> 
>>>> All,
>>>> 
>>>> During the CA/Browser Forum Face-to-Face 59 meeting, several Root Store 
>>>> Programs expressed an interest in improving Web PKI incident reporting.
>>>> 
>>>> The CCADB Steering Committee is interested in this community’s 
>>>> recommendations on improving the standards applicable to and the overall 
>>>> quality of incident reports submitted by Certification Authority (CA) 
>>>> Owners. We aim to facilitate effective collaboration, foster transparency, 
>>>> and promote the sharing of best practices and lessons learned among CAs 
>>>> and the broader community.
>>>> 
>>>> Currently, some Root Store Programs require incident reports from CA 
>>>> Owners to address a list of items in a format detailed on ccadb.org 
>>>> <http://ccadb.org/> [1]. While the CCADB format provides a framework for 
>>>> reporting, we would like to discuss ideas on how to improve the quality 
>>>> and usefulness of these reports.
>>>> 
>>>> We would like to make incident reports more useful and effective where 
>>>> they:
>>>> 
>>>> Are consistent in quality, transparency, and format.
>>>> Demonstrate thoroughness and depth of investigation and incident analysis, 
>>>> including for variants.
>>>> Clearly identify the true root cause(s) while avoiding restating the issue.
>>>> Provide sufficient detail that enables other CA Owners or members of the 
>>>> public to comprehend and, where relevant, implement an equivalent solution.
>>>> Present a complete timeline of the incident, including the introduction of 
>>>> the root cause(s).
>>>> Include specific, actionable, and timebound steps for resolving the 
>>>> issue(s) that contributed to the root cause(s).
>>>> Are frequently updated when new information is found and steps for 
>>>> resolution are completed, delayed, or changed. 
>>>> Allow a reader to quickly understand what happened, the scope of the 
>>>> impact, and how the remediation will sufficiently prevent the root cause 
>>>> of the incident from reoccuring. 
>>>> 
>>>> We appreciate, to state it lightly, members of this community and the 
>>>> general public who generate and review reports, offer their understanding 
>>>> of the situation and impact, and ask clarifying questions. 
>>>> 
>>>> Call to action: In the spirit of continuous improvement, we are requesting 
>>>> (and very much appreciate) this community’s suggestions for how CA 
>>>> incident reporting can be improved.
>>>> 
>>>> Not every suggestion will be implemented, but we will commit to reviewing 
>>>> all suggestions and collectively working towards an improved standard.
>>>> 
>>>> Thank you
>>>> -Clint, on behalf of the CCADB Steering Committee
>>>> 
>>>> [1] https://www.ccadb.org/cas/incident-report 
>>>> 
>>>> -- 
>>>> You received this message because you are subscribed to the Google Groups 
>>>> "CCADB Public" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>>> email to [email protected] 
>>>> <mailto:[email protected]>.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/a/ccadb.org/d/msgid/public/3B253FFF-4070-4F0E-95D2-166FAC01C5A7%40apple.com
>>>>  
>>>> <https://groups.google.com/a/ccadb.org/d/msgid/public/3B253FFF-4070-4F0E-95D2-166FAC01C5A7%40apple.com?utm_medium=email&utm_source=footer>.
>>> 
>>> 
>>> -- 
>>> You received this message because you are subscribed to the Google Groups 
>>> "CCADB Public" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>> email to [email protected] <mailto:[email protected]>.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/a/ccadb.org/d/msgid/public/6519E364-F7FB-438F-9D18-AFF416554857%40apple.com
>>>  
>>> <https://groups.google.com/a/ccadb.org/d/msgid/public/6519E364-F7FB-438F-9D18-AFF416554857%40apple.com?utm_medium=email&utm_source=footer>.
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "CCADB Public" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] <mailto:[email protected]>.
> To view this discussion on the web visit 
> https://groups.google.com/a/ccadb.org/d/msgid/public/CAEmnErfqHLrq3P6ed%2B6rQh30CNomDeJFkQ39bp5aiVLvXzqpjg%40mail.gmail.com
>  
> <https://groups.google.com/a/ccadb.org/d/msgid/public/CAEmnErfqHLrq3P6ed%2B6rQh30CNomDeJFkQ39bp5aiVLvXzqpjg%40mail.gmail.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"CCADB Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/a/ccadb.org/d/msgid/public/CAF97E33-417C-4861-9FD8-66BABEC44FB2%40gmail.com.

Re: Request for Input: CA Incident Reporting

Reply via email to