On the topic of root causes, there's also
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3425554 that was
recently published. I'm not sure if that was peer reviewed, but it does
provide an analysis of m.d.s.p and Bugzilla. I have some concerns about the
study methodology (for example, when incident reports became normalized is
relevant, as well as incident reporting where security researchers first
went to the CA), but I think it looks at root causes a bit holistically.

I recently shared on the CA/B Forum's mailing list another example of
"routine" violation:
https://cabforum.org/pipermail/servercert-wg/2019-October/001154.html

My concern is that, 7 years later, while I think that compliance has
marginally improved (largely due to things led by outside the CA ecosystem,
like CT and ZLint/Certlint), I think the answers/responses/explanations we
get are still falling into the same predictable buckets, and that concerns
me, because it's neither sustainable nor healthy for the ecosystem.


   - We misinterpreted the requirements. It said X, but we thought it meant
   Y (Often: even though there's nothing in the text to support Y, that's just
   how we used to do business, and we're CAs so we know more than browsers
   about what browsers expect from us)
   - We weren't paying attention to the updates. We've now assigned people
   to follow updates.
   - We do X by saying our staff should do X. In this case, they forgot.
   We've retrained our staff / replaced our staff / added more staff to
   correct this.
   - We had a bug. We did not detect the bug because we did not have tests
   for this. We've added tests.
   - We weren't sure if X was wrong, but since no one complained, we
   assumed it was OK.
   - Our auditor said it was OK
   - Our vendor said it was OK

and so forth.

And then, in the responses, we generally see:

   - These certificates are used in Very Important Systems, so even though
   we said we'd comply, we cannot comply.
   - We don't think X is actually bad. We think X should be OK, and it
   should be Browsers that reject X if they don't like X (implicit: But they
   should still trust our CA, even though we aren't doing what they want)
   - Our vendor is not able to develop a fix in time, so we need more time.
   - We agree that X is bad, and has always been prohibited, but we need
   more time to actually implement a fix (because we did not plan/budget/staff
   to actually handle issues of non-compliance)

and so forth.

It's tiring and exhausting because we're hearing the same stuff. The same
patterns that CAs were using when they'd issue MITM certs to companies:
"Oh, wait, you mean't DON'T issue MITM certs? We didn't realize THAT'S what
you meant" (recall, this was at least one CA's response when caught issuing
MITM certs).

I'm exasperated because we're seeing CAs do things like not audit sub-CAs,
but leaving all the risk to be accepted by browsers, because it's too
hard/complex to migrate. We're seeing things like CA's not follow policy
requirements, but then correcting those issues is risky because now they've
issued a bunch of certs and it's painful to have to replace them all.

If we go back to that classic Dan Geer talk,
https://cseweb.ucsd.edu/~goguen/courses/275f00/geer.html , every time a CA
issues a certificate, they've now externalized the risk onto browsers/root
stores for that certificate lifetime. It's left to the ecosystem to detect
and clean up the mess, while the CA/subscriber gets the full benefits of
the issuance. It's a system of incentives that is completely misaligned,
and we've seen it now for the past decade: The CA benefits from the
(mis)issuance, and extracts value until it's detected, and then the cost of
cleanup is placed on the browser/Root Program that expects CAs to actually
conform. If the Browser doesn't enforce, or consistently enforce, then we
get back to the "Race to the bottom" that plagued the CA industry, as
"Requirements" become "Suggestions" or "Nice ideas". Yet if the Browser
does enforce, they suffer the blame from the Subscriber, who is unhappy
that the thing they bought no longer works.

In all of this time, it doesn't seem like we're making much progress on
systemic understanding and prevention. If that's an unfair statement, then
it means that some CAs are progressing, and some aren't, so how do we help
the ones that aren't? At what point do we go from education to removal of
trust? Where is the line when the same set of responses have been used so
much that it's no longer reasonable? When this ecosystem moves at a snail's
pace, due to CAs' challenges in updating systems and the long lifetime of
certificates, the feedback loop is large, and CAs can exploit that
asymmetry until they're detected. That may sound like I'm ascribing
intentional malice, when I'm mainly just talking about the perverse
incentives here that are hindering meaningful improvement.

While I appreciate your suggestion of more transparency, and I'm notably
all for it, this wouldn't help with, for example, QuoVadis' response to the
issue. To borrow from Donald Rumsfeld, the set of issues with any single CA
are, from the browser perspective, the "unknown unknowns". Such a report
would not tell us, for example, that QuoVadis viewed renewal and issuance
as separate and independent from requirements. Unless we had all of their
processes and procedures in front of us, to review the diff, we wouldn't
spot that there was an "issuance playbook" and a "renewal playbook". Of
course, there might not have even been a "renewal" playbook until that
matter came up, so if they created it new, we also wouldn't have detected
it.

In theory, the incident reports are meant to help the ecosystem improve.
But if we see egregiously bad incident reports, as I think we have, or
incident reports that are equivalent to stonewalling for answers by trying
to give the shortest, least possible information, and we move to take
sanction on those CAs, we only discourage future incident reporting.

To bring this back, now, to the original topic at hand: What should we be
doing when requirements are phased in, with years of notice, advanced
communication, and they're still violated? What should we be doing when
clear-cut requirements are violated?

I see a few options:
(a) Accept that what we're doing is not enough, and do something different.
If so, what would be different, compared to everything that's been tried?
That was the original gist of the first message.
(b) Accept that what we're doing is enough, and the CAs that are failing
are simply not up to the task expected of them, and removing them is the
only way to correct this. This was the gist of the second message.
(c) Accept that this system is inherently flawed, and the incentive
structures misaligned such that this is a natural expectation of any
complex system. If that's the case, perhaps we should more holistically
look to replace the system?

This is relevant with the Policy 2.7 update. With all of the effort to
provide added clarity and improved requirements, do we have reason to
believe that CAs will adopt and follow it? The past approach is to send a
CA communication and require affirmative consent. That clearly is not
working (for some CAs). Suggestions of doing it in the Forum are sometimes
raised, but that clearly (per the related message) is also failing. So, is
there something different to try? I like the suggestion of listing
everything that the CA is changing as part of their operation, although I
don't think it will prevent these issues (back to "unknown unknowns"). I
don't have much faith that the auditors will catch these issues, BR or
otherwise. So... what do we have to make sure Policy 2.7 goes off smoothly?

>
_______________________________________________
dev-security-policy mailing list
dev-security-policy@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security-policy

Reply via email to