Speaking from a personal perspective - This all makes sense, and, to be honest, the spectrum/grade idea isn’t a good or robust. Implementing something like that requires too many judgment questions about whether a CA belongs in box x vs. box y and what is the difference between those two boxes. I also get the frustration with certain issues, especially when the pop up among CAs – especially if the rule is well established.
I’ve been looking at the root causes of mis-issuance in detail (stating with DigiCert) and so far I’ve found they divide into a few buckets. 1) Where the CA relied on a third party for something and probably shouldn’t have, 2) where there was an internal engineering issue, 3) a manual process went bad, 4) software the CA relied on had an issue, or 5) the CA simply couldn’t/didn’t act in time. From the incidents I’ve categorized so far (still working on all the incidents for all CAs), the biggest buckets seem like engineering issues followed by manual process issues. For example, at DigiCert proper the engineering issues represent about 35% of the issues. (By DigiCert proper, I exclude the Sub CAs and Quovadis systems – this allows me to look exclusively at our internal operations compared to the operations of somewhat separate systems.) The next biggest issue is our failure to move fast enough (30%) followed by manual process problems (24%). DigiCert proper doesn’t use very much third party software in its CA so that tends to be our smallest bucket. The division between these categories is interesting because some are less in control of the CA than others. For example, if primekey has an issue pretty much everyone has an issue since so many CAs use primekey at some level (DigiCert via Quovadis). The division is also somewhat arbitrary and based solely on the filed incident reports. However, what I’m looking for is whether the issues result from human error, insufficient implementation timelines, engineering issues, or software issues. I’m not ready to make a conclusion industry-wide. The trend I’ve noticed at DigiCert is the percent of issues related to DigiCert manual processes is decreasing while the percent of engineering blips is increasing. This is a good trend as it means we are moving away from manual processes and into better automation. What else is interesting is the number of times we’ve had issues with moving too slow has dropped significantly over the last two year, which means we’ve seen substantial improvement in communication and handling of changes in industry standards. The number of issues increased, but I chalk that up to more transparency and scrutiny by the public (a good thing) than worse systems. The net result is a nice report that we’re using internally (and will share externally) that shows where the biggest improvements have been made. We’re also hoping this data shows where we need to concentrate more. Right now, the data is showing more focus on engineering and unit tests to ensure all systems are updated when a guideline changes. So why do I share this data now before it’s ready? Well, I think looking at this information can maybe help define possible solutions. Long and windy, but… One resulting idea is that maybe you could require a report on improvements from each CA based on their issues? The annual audit could include a report similar to the above where the CA looks at the past year of their own mistakes and the other industry issues and evaluates how well they did compared to previous years. This report can also describe how the CA changed their system to comply with any new Mozilla or CAB Forum requirements. What automated process did they put in place to guarantee compliance? This part of the audit report can be used to reflect on the CA operations and make suggestions to the browsers on where they need to improve and where they need to automate. It can also be used to document one area of improvement they need to focus on. Although this doesn’t cure immediate mis-issuances such does give better transparency into what CAs are doing to improve and exactly how they implemented the changes made to the Mozilla policy. A report like this also shifts the burden of dealing with issues to the community instead of the module owners and emphasizes the CA on fixing their systems and learning from mistakes. With the change to WebTrust audits, there’s an opportunity for more free-form reporting that can include this information. And this information has to be fare more interesting than reading about yet another individual who forgot to check a box in CCADB. This is still more reactive than I’d like and sometimes requires a whole year before a CA gives information about the changes made to systems to reflect changes in policy. The report does get people thinking proactively about what they need to do to improve, which may, by itself, be a force for improvement. This also allows the community to evaluate a CA’s issues over the past year and how they addressed what went wrong compared to previous years and see what the CA is doing that will make the next year will be even better. Jeremy From: Ryan Sleevi <[email protected]> Sent: Monday, October 7, 2019 6:45 PM To: Jeremy Rowley <[email protected]> Cc: mozilla-dev-security-policy <[email protected]>; [email protected] Subject: Re: Mozilla Policy Requirements CA Incidents On Mon, Oct 7, 2019 at 7:06 PM Jeremy Rowley <[email protected]<mailto:[email protected]>> wrote: Interesting. I can't tell with the Netlock certificate, but the other three non-EKU intermediates look like replacements for intermediates that were issued before the policy date and then reissued after the compliance date. The industry has established that renewal and new issuance are identical (source?), but we know some CAs treat these as different instances. Source: Literally every time a CA tries to use it as an excuse? :) My question is how we move past “CAs provide excuses”, and at what point the same excuses fall flat? While that's not an excuse, I can see why a CA could have issues with a renewal compared to new issuance as changing the profile may break the underlying CA. That was Quovadis’s explanation, although with no detail to support that it would break something, simply that they don’t review the things they sign. Yes, I’m frustrated that CAs continue to struggle with anything that is not entirely supervised. What’s the point of trusting a CA then? However, there's probably something better than "trust" vs. "distrust" or "revoke" v "non-revoke", especially when it comes to an intermediate. I guess the question is what is the primary goal for Mozilla? Protect users? Enforce compliance? They are not mutually exclusive objectives of course, but the primary drive may influence how to treat issuing CA non-compliance vs. end-entity compliance. I think a minimum goal is to ensure the CAs they trust are competent and take their job seriously, fully aware of the risk they pose. I am more concerned about issues like this which CAs like QuoVadis acknowledges they would not cause. The suggestion of a spectrum of responses fundamentally suggests root stores should eat the risk caused by CAs flagrant violations. I want to understand why browsers should continue to be left holding the bag, and why every effort at compliance seems to fall on how much the browsers push. Of the four, only Quovadis has responded to the incident with real information, and none of them have filed the required format or given sufficient information. Is it too early to say what happens before there is more information about what went wrong? Key ceremonies are, unfortunately, very manual beasts. You can automate a lot of it with scripting tools, but the process of taking a key out, performing a ceremony, and putting things a way is not automated due to the off-line root and FIPS 140-3 requirements. Yes, I think it’s appropriate to defer discussing what should happen to these specific CAs. However, I don’t think it’s too early to begin to try and understand why it continues to be so easy to find massive amounts of misissuance, and why policies that are clearly communicated and require affirmative consent is something CAs are still messing up. It suggests trying to improve things by strengthening requirements isn’t helping as much as needed, and perhaps more consistent distrusting is a better solution. In any event, having CAs share the challenges is how we do better. Understanding how the CAs not affected prevent these issues is equally important. We NEED CAs to be better here, so what’s the missing part about why it’s working for some and failing for others? I know it seems extreme to suggest to start distrusting CAs over this, but every single time, it seems there’s a CA communication, affirmative consent, and then failure. The most recent failure to disclose CAs is equally disappointing and frustrating, and it’s not clear we have CAs adequately prepared to comply with 2.7, no matter how much we try. _______________________________________________ dev-security-policy mailing list [email protected] https://lists.mozilla.org/listinfo/dev-security-policy

