I think the question posed is a good one, and, as a relying party, that the
answer should be "whatever it takes."
More specifically, I think CAs are often misinterpreting what responses
should be in case of a delayed revocation event. Rather than generalities
or abstract strategies ("we will encourage adoption of automation!" or "we
will discourage certificate pinning" or...), the actions should be specific
and concrete, and tied to each subscriber for whom there was a delay. Some
examples, inspired by recent delayed revocation events, of the kinds of
actions I would expect from CAs:
"We didn't know subscriber X was using certificates to manage their fleet
of nuclear reactors. Since that is against our terms of service, subscriber
X is no longer a customer of ours."
OR
"We didn't know that subscriber Y was contractually required to notify
others 90 days in advance of a certificate change. Since they cannot
possibly abide by the 1 day/5 day revocation timelines, we have moved them
to be on a privatePKI [or they are no longer our customer]."
OR
"Subscriber Z needed a delay because they had one person responsible for
manually requesting and replacing 5000 certificates. We have worked with
that subscriber so that certificate replacement is now automated, and can
be completed in the timelines required by the BR.s"
OR
"We did not grant a revocation delay to subscriber Q because they said we
cannot communicate the reason for the delay to the root programs and the
community."
The fact that we are not seeing responses like these I think speaks volumes
about the current priorities within a majority of CAs. The question of
course is, what do to so that these kinds of responses become the norm?
Personally, I think the idea of having 1 day / 5 day revocation technically
enforced by the root programs (rather than by the CAs) has a lot of merit.
In particular, it would change the conversation that CAs have with
subscribers from "maybe we will delay, maybe not, convince us" to "these
certificates are being revoked in 5 days. What can we do to help it not be
an interruption to your line of business?". I think this is actually a
really strong benefit of the approach. The biggest weakness I can see is
that it might promote "intentional ignorance," within CAs, at least for a
period of time, where "oops, our first list didn't have all the mis-issued
certificates, sorry about that." becomes a refrain. This could be nipped in
the bud if the root store policy were "if we find that is happening, then
at the next revocation event we will add all affected issuing
intermediates as revoked." Which is a strong incentive to do it right the
first time [and, also, to encourage use of more intermediates to limit
blast radius's, which is also a good thing in my opinion].
I am less sold on the idea of capping the validity time of certificates of
CAs with more than their share of problems. While this does push towards
greater agility, I doubt it will have the intended effect, instead pushing
most subscribers to other CAs who still issue yearly.
Whether or not a technical solution like these are adopted, it is clear to
me that something must be done here. Perhaps root programs should also be
much more willing to revoke roots over delay events ("three times in a year
and you're out."?). But I think the intermediate technical solutions have a
better chance of driving the ecosystem to where it needs to go.
On Monday, May 20, 2024 at 8:46:00 PM UTC-4 Mike Shaver wrote:
> DELAYED REVOCATION IS TOO COMMON
>
> This is long enough, so I’ll spare readers dozens of links to
> delayed-revocation incidents collected in Bugzilla; we all know that pretty
> much any other incident that involves misissuance will come with a
> delayed-revocation chaser these days.
>
> In *many* of those delrev (?)incidents, we see a phrase like “we requested
> that our subscribers revoke and reissue”. They are not informing their
> subscribers as to a fixed revocation timeline, but rather simply asking if
> those subscribers if they would please do the revocation process when
> they’re able. In one case, I heard of a revocation request from a major CA
> that didn’t even have a timeline *suggested*. Of course, the subscriber
> gets no value out of replacing their certs: it’s pure overhead, and if
> WebPKI were operated perfectly, it would never be necessary. This is an
> externality of, most often, a CA’s failure to sufficiently invest in
> understanding, implementing, and verifying the processes that they use to
> twirl the keys to the entire web’s security.
>
> Indeed in a number of cases the CAs didn’t even stop issuing once they
> realized that they were misissuing certs! Intentionally issuing certs that
> are known to be bad, what a world.
>
> While CAs generally claim that they would be able to handle a mass
> revocation incident (such as due to leaked key material), the evidence we
> have for CAs aggressively revoking as called for by the BRs and the root
> programs is…scant. We’ve seen “it was a long weekend” as a reason for
> delaying revocation for certs—including some used by a different part of
> the CA’s company! One CA has proposed a “global fire drill” to stress-test
> revocation procedures, but we’re seeing revocation timelines reaching
> multiple months right now, so…a lot of stuff would end up burning in that
> fire.
>
> CAs also tell us that they advocate and recommend for their subscribers to
> implement automation for cert management, but we never see any concrete
> targets or success criteria for those efforts, so they certainly seem to me
> to just be more “asking nicely”. (I’m not sure that all of the CAs claiming
> to be pushing for subscriber automation actually have robust ACME or
> similar support yet, in fact.)
>
> (Some of the CAs made explicit promises years ago to not delay revocation,
> some of them issued even though they knew that zlint showed an error—there
> are lots of additional twists on simply “issuing bad certs and not cleaning
> them up as agreed”.)
>
> Now, in the wake of these *many* delrev incidents, over years of history,
> the root programs have responded with pretty much no consequences
> whatsoever as far as I can tell. There’s one case open about Entrust’s
> overall behaviour, who are certainly over-achieving when it comes to ways
> to get location fields wrong, but they are definitely not the only ones who
> treat the BRs’ 1/5-day revocation instruction as instead meaning “when it’s
> convenient for the customer”.
>
> THE QUESTION
>
> So: what should be done to make revocations of misissued
> certificates—sometimes *intentionally* misissued certificates—as prompt as
> the BRs require?
>
> The cost equation for CAs is obviously skewed against the health the web
> PKI, if we are to believe that the BRs are important. Once a CA has
> violated the BRs and misissued, it is *in their commercial interest* to not
> revoke promptly: it causes embarrassment and subscriber frustration, or
> even disruption to subscriber services. At the limit it might even lead a
> subscriber to change CAs if the reissuance events are frequent and
> disruptive enough.
>
> On the other hand, the more bad certs there are floating around, even if
> it’s “only” a matter of a case mismatch, the less interoperable the web PKI
> is, and the harder it is for a relying party to make effective use of
> WebPKI’s guarantees. Let’s please not end up with a “quirks mode” for TLS
> certificate handling!
>
> SOME OPTIONS
>
> One option: decide that there really are some BR violations that “don’t
> matter”, such that revocation can happen on a more relaxed, accommodating
> timeline—or perhaps not at all, just letting them expire as has been seen
> in some delrev incidents already. This would mean that we would still see
> incident reports that in theory help other CAs learn to put the postal code
> in the right field or similar, but subscribers and CAs and root programs
> would have to do less work.
>
> Another option: have affected certificates added to OneCRL after 72 hours.
> It would benefit from some automation, but it’s probably feasible to make
> relatively smooth. It is sometimes the case, worryingly, that it takes CAs
> a fair bit of time and multiple attempts to find all the affected
> certificates, so this might require some linter running off CT logs or
> similar as a watchdog.
>
> Another another option: forbid CAs from selling WebPKI certificates into
> environments where a) revocation within a 5-day limit is operationally
> infeasible, and b) disruption of the related services would cause risk to
> human health and safety or similar. There are apparently many organizations
> out there which are critical to national economies or whatever, but need
> literal Earth months to replace a certificate. These are clearly cases
> where the requirements of WebPKI are incompatible with the operational
> constraints of the subscriber, so it’s not a good idea to mix them. (I’m
> sure some CAs could offer help with private PKI systems, probably with
> compelling margins.)
>
> Yet another, this time somewhat more preventative: if a CA repeatedly
> demonstrates that they are unable or (always the case?) unwilling to honour
> their commitments to the BRs, impose validity length restrictions on certs
> that they issue. At least in that case future misissued certificates would
> be in the wild for longer, and it would also show nicely that CAs’ advocacy
> for certificate automation was fruitful. Ignoring Entrust’s diatribe
> against 90-day validity periods in that weird blog post, I don’t think that
> any CA has made a credible case that their customers would not be able to
> handle rotating certificates every 90 days, even if they have to carve the
> new fingerprint into a mountain using a toothbrush or whatever. They’d even
> know it’s coming.
>
> One more: make delayed revocation incidents, specifically, more visible to
> subscribers and potential subscribers, and see if business pressure does
> what merely “agreeing legally to follow the BRs” (and optionally making
> empty “it’ll never happen again” promises) has been unable to do in too
> many cases.
>
> THANKS FOR READING
>
> I think the WebPKI is being poorly served by the *realities* of
> certificate integrity and misissuance responses. If nothing else, it’s
> causing a ton of delrev incidents for Ben to have to shepherd, without even
> module peers to assist him.
>
> Something needs to change.
>
--
You received this message because you are subscribed to the Google Groups
"[email protected]" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/a/mozilla.org/d/msgid/dev-security-policy/f04a0a55-39b1-4127-a69d-14923271cb39n%40mozilla.org.