On the topic of root causes, there's also https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3425554 that was recently published. I'm not sure if that was peer reviewed, but it does provide an analysis of m.d.s.p and Bugzilla. I have some concerns about the study methodology (for example, when incident reports became normalized is relevant, as well as incident reporting where security researchers first went to the CA), but I think it looks at root causes a bit holistically.
I recently shared on the CA/B Forum's mailing list another example of "routine" violation: https://cabforum.org/pipermail/servercert-wg/2019-October/001154.html My concern is that, 7 years later, while I think that compliance has marginally improved (largely due to things led by outside the CA ecosystem, like CT and ZLint/Certlint), I think the answers/responses/explanations we get are still falling into the same predictable buckets, and that concerns me, because it's neither sustainable nor healthy for the ecosystem. - We misinterpreted the requirements. It said X, but we thought it meant Y (Often: even though there's nothing in the text to support Y, that's just how we used to do business, and we're CAs so we know more than browsers about what browsers expect from us) - We weren't paying attention to the updates. We've now assigned people to follow updates. - We do X by saying our staff should do X. In this case, they forgot. We've retrained our staff / replaced our staff / added more staff to correct this. - We had a bug. We did not detect the bug because we did not have tests for this. We've added tests. - We weren't sure if X was wrong, but since no one complained, we assumed it was OK. - Our auditor said it was OK - Our vendor said it was OK and so forth. And then, in the responses, we generally see: - These certificates are used in Very Important Systems, so even though we said we'd comply, we cannot comply. - We don't think X is actually bad. We think X should be OK, and it should be Browsers that reject X if they don't like X (implicit: But they should still trust our CA, even though we aren't doing what they want) - Our vendor is not able to develop a fix in time, so we need more time. - We agree that X is bad, and has always been prohibited, but we need more time to actually implement a fix (because we did not plan/budget/staff to actually handle issues of non-compliance) and so forth. It's tiring and exhausting because we're hearing the same stuff. The same patterns that CAs were using when they'd issue MITM certs to companies: "Oh, wait, you mean't DON'T issue MITM certs? We didn't realize THAT'S what you meant" (recall, this was at least one CA's response when caught issuing MITM certs). I'm exasperated because we're seeing CAs do things like not audit sub-CAs, but leaving all the risk to be accepted by browsers, because it's too hard/complex to migrate. We're seeing things like CA's not follow policy requirements, but then correcting those issues is risky because now they've issued a bunch of certs and it's painful to have to replace them all. If we go back to that classic Dan Geer talk, https://cseweb.ucsd.edu/~goguen/courses/275f00/geer.html , every time a CA issues a certificate, they've now externalized the risk onto browsers/root stores for that certificate lifetime. It's left to the ecosystem to detect and clean up the mess, while the CA/subscriber gets the full benefits of the issuance. It's a system of incentives that is completely misaligned, and we've seen it now for the past decade: The CA benefits from the (mis)issuance, and extracts value until it's detected, and then the cost of cleanup is placed on the browser/Root Program that expects CAs to actually conform. If the Browser doesn't enforce, or consistently enforce, then we get back to the "Race to the bottom" that plagued the CA industry, as "Requirements" become "Suggestions" or "Nice ideas". Yet if the Browser does enforce, they suffer the blame from the Subscriber, who is unhappy that the thing they bought no longer works. In all of this time, it doesn't seem like we're making much progress on systemic understanding and prevention. If that's an unfair statement, then it means that some CAs are progressing, and some aren't, so how do we help the ones that aren't? At what point do we go from education to removal of trust? Where is the line when the same set of responses have been used so much that it's no longer reasonable? When this ecosystem moves at a snail's pace, due to CAs' challenges in updating systems and the long lifetime of certificates, the feedback loop is large, and CAs can exploit that asymmetry until they're detected. That may sound like I'm ascribing intentional malice, when I'm mainly just talking about the perverse incentives here that are hindering meaningful improvement. While I appreciate your suggestion of more transparency, and I'm notably all for it, this wouldn't help with, for example, QuoVadis' response to the issue. To borrow from Donald Rumsfeld, the set of issues with any single CA are, from the browser perspective, the "unknown unknowns". Such a report would not tell us, for example, that QuoVadis viewed renewal and issuance as separate and independent from requirements. Unless we had all of their processes and procedures in front of us, to review the diff, we wouldn't spot that there was an "issuance playbook" and a "renewal playbook". Of course, there might not have even been a "renewal" playbook until that matter came up, so if they created it new, we also wouldn't have detected it. In theory, the incident reports are meant to help the ecosystem improve. But if we see egregiously bad incident reports, as I think we have, or incident reports that are equivalent to stonewalling for answers by trying to give the shortest, least possible information, and we move to take sanction on those CAs, we only discourage future incident reporting. To bring this back, now, to the original topic at hand: What should we be doing when requirements are phased in, with years of notice, advanced communication, and they're still violated? What should we be doing when clear-cut requirements are violated? I see a few options: (a) Accept that what we're doing is not enough, and do something different. If so, what would be different, compared to everything that's been tried? That was the original gist of the first message. (b) Accept that what we're doing is enough, and the CAs that are failing are simply not up to the task expected of them, and removing them is the only way to correct this. This was the gist of the second message. (c) Accept that this system is inherently flawed, and the incentive structures misaligned such that this is a natural expectation of any complex system. If that's the case, perhaps we should more holistically look to replace the system? This is relevant with the Policy 2.7 update. With all of the effort to provide added clarity and improved requirements, do we have reason to believe that CAs will adopt and follow it? The past approach is to send a CA communication and require affirmative consent. That clearly is not working (for some CAs). Suggestions of doing it in the Forum are sometimes raised, but that clearly (per the related message) is also failing. So, is there something different to try? I like the suggestion of listing everything that the CA is changing as part of their operation, although I don't think it will prevent these issues (back to "unknown unknowns"). I don't have much faith that the auditors will catch these issues, BR or otherwise. So... what do we have to make sure Policy 2.7 goes off smoothly? > _______________________________________________ dev-security-policy mailing list firstname.lastname@example.org https://lists.mozilla.org/listinfo/dev-security-policy