Re: [Acme] Practical concerns of draft-ietf-acme-ari

Rob Stradling Thu, 20 Jul 2023 04:14:19 -0700

> Totally agreed, we don't love the heavy-polling nature of ARI as it stands 
> either. It's a lot of requests...
> ...
> This leads to the question of: what should we use to uniquely identify the 
> certificate instead? Certainly we could go with the "fingerprint" or 
> "thumbprint" (a sha256 hash of DER bytes or PEM encoding, depending on who 
> you ask, of the certificate) if people think that is sufficiently simple, 
> easy to specify, unique, and future-proof. We could also go with "just the 
> Serial", and force existing ACME servers to choose between either keeping 
> serials unique across all issuers they represent, or splitting the server 
> into multiple servers which each represent just a single issuer. Or we could 
> return to the "url in the Order object" approach we started with. I'm curious 
> what path forward people think is best.


For reasons I outlined in 
https://mailarchive.ietf.org/arch/msg/acme/aoiW7X3lPYoQ6X8hhRGEvG3HDmo/, I have 
a strong preference for sticking with CertID and an equally strong preference 
against a 'return to the "url in the Order object" '.

________________________________
From: Acme <[email protected]> on behalf of Aaron Gable 
<[email protected]>
Sent: 19 July 2023 23:05
To: Tim Hollebeek <[email protected]>
Cc: Matthew Holt <[email protected]>; [email protected] <[email protected]>
Subject: Re: [Acme] Practical concerns of draft-ietf-acme-ari


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you recognize the sender and know the content 
is safe.


Hi Matt,

Agreed with Tim, receiving practical feedback from implementers of the draft 
standard is very useful. I'll put my thoughts, comments, and questions in-line.

On Fri, Jun 23, 2023 at 9:21 AM Matthew Holt 
<[email protected]<mailto:[email protected]>> wrote:

With respect to ARI, ACME servers and clients have conflicts of interest. The 
ACME client's goal is to keep the site up (with renewed and unrevoked 
certificates); the optimal way to do this is to start renewing early and retry 
often. The ACME server's goal is to keep the service up; the optimal way to do 
this is to suppress clients that overload your capacity. Obviously, these two 
goals are in opposition with each other. Proactive clients can spike demand, 
which can cause service interruptions. But service interruptions make clients 
more paranoid to retry even more often until it works, and so on. ARI narrows 
the timeframe in which a conforming client can retry failed renewals, which 
reduces reliability more as time goes on. Without ARI, this window is a 
reasonable ~60 days. With ARI, however, the window is reduced to just a few 
minutes, hours, or days. The less time until expiration, the less hope there is 
to renew the cert in time. As the draft currently stands, this is in the 
server's interest, but not the client's.

I'm confused by the statement that "with ARI the window is reduced to just a 
few minutes, hours, or days". The draft spec clearly states that the client 
should renew during the window if it can, but that any time after the window is 
also acceptable: "if the selected time is in the past, attempt renewal 
immediately". The renewal window only becomes reduced to a few minutes, hours, 
or days if the ACME server shifts the suggested renewal window that far. Which, 
sure, is possible, but is clearly against the server's best interest as well: 
if the ACME server can't provide continuity of business to their Subscribers, 
then their Subscribers will go elsewhere for certificates.

Can this be improved? Absolutely, I'm certain of it. I'd love to hear 
suggestions for ways that the server could suggest a renewal time that doesn't 
run up against this push/pull between wanting to smooth traffic without making 
clients nervous. Unfortunately, I don't believe either of the suggestions at 
the bottom of the message actually addresses this point (more on that below).

1) It is optional. No one will implement this. OK, some clients will -- but I 
can say with authority from years of experience that optional restrictions are 
not typically favored. Very little mainstream software follow best practices to 
a tee.

Yep, optional features are difficult to incentivize. I think there's one 
obvious carrot to incentivize client adoption: "if you implement ARI, your 
certs will be renewed *before* they're revoked in the next mass revocation 
incident". Continuity of business can be a powerful motivator. Frankly, Let's 
Encrypt is even considering bigger carrots, such as "your subscriber account 
can only get short-lived certs if we've seen it request ARI endpoints", or 
"your renewal requests bypass all rate limits if they're made within the ARI 
suggested window". We don't know if we'll dangle either of those carrots, but 
it's clear that there are ways to incentivize adoption.

2) A narrower renewal timeframe makes clients less reliable. In theory it 
should make them *more* reliable since it smooths out traffic, thus improving 
CA availability. But this assumes that most clients actually implement and 
follow ARI. Since it's optional, I don't see that happening. Especially since 
most ACME clients are still running as static cron jobs like it's 2015...

I'm sure ARI doesn't really change in the nominal case, which is 99.9..9% of 
the time. In fact, Let's Encrypt's ARI seems to correspond with when my clients 
attempt renewals on their own anyway. (So in that sense, ARI is actually 
useless 99.9..9% of the time?)

But when a renewal window does change, what does that mean? Well, something is 
wrong. Either the certificate is being revoked, or the CA anticipates downtime 
or availability issues.

This is not true. Explicitly, by the spec, the renewal window changing means 
nothing. The situations you list are the motivations for writing the spec in 
the first place, but they are not the only motivations for changing the window 
in any given case. In fact, Let's Encrypt is currently considering adding 
random jitter to the renewal window every time it is requested, specifically to 
prevent interpretations like this, and to naturally even-out renewal spikes 
through Brownian motion.

If we wait until the (adjusted) window to start renewing, we run ourselves 
closer to the imminently-impending revocation or the expiration of the 
certificate, lowering our chances of a successful renewal.

This assumes that the adjusted window will always be later in the lifetime of 
the certificate than before. There is no reason to make this assumption. A CA 
adjusting suggested windows in order to smooth out a load spike would be wise 
to shift 50% of renewal windows earlier. Waiting to renew until a time that is 
earlier than when you would have renewed anyway does not make things riskier.

1) Many CAs enforce rate limits. If clients are to honor ARI windows, we would 
need a guarantee that the first successful cert within the ARI window will be 
allowed regardless of relevant rate limits. Because ARI restricts a client's 
ability to spread out renewals when managing certificates in bulk with respect 
to rate limits, the rate limits must NOT be a blocker when honoring ARI.

I like this idea. We hope and plan to implement this regardless, as I suggested 
above with regards to it being a carrot that we can dangle to incentivize 
client adoption. However, I don't believe it is something that can be 
reasonably specified in an IETF RFC: rate limits are not part of the ACME 
protocol, they're an internal detail of ACME server implementations. Happy to 
be proven wrong.

2) If ARI were actually enforced, some concerns would be resolved... for 
example, we can have assurances that other ACME clients are doing the same, 
thus improving CA availability. It would essentially be the CA scheduling each 
individual certificate for each ACME client instance -- that's quite a powerful 
idea, as long as availability is guaranteed (which it's not).

What do you mean by "enforced"? Deny newOrder requests that appear to be 
renewals but fall outside the suggested window?

3) ARI does not scale well. Some ACME clients manage 10K+ certificates, and in 
that case the client would have to check the ARI for at least 24 certificates 
per hour to get through them in a month. Deferring to the Retry-After header 
may result in insufficient throughput. The current expectation or convention is 
to check every certificate every 6-12 hours, or tens of thousands of checks per 
day. One endpoint per certificate multiple times per day is quite saturating. 
This is a considerable burden for both ACME clients and servers. I would like 
to explore options that do not involve 2+ HTTP requests per certificate.

Totally agreed, we don't love the heavy-polling nature of ARI as it stands 
either. It's a lot of requests, and that's a large part of why we've striven to 
keep the response size so small. The original version of this was just a single 
timestamp. It's grown to two timestamps and an optional URL thanks to community 
feedback, but I'd be happy to reduce the response size again if we decide that 
prioritizing efficiency is more important than prioritizing third-party 
certificate monitoring tools.

Unfortunately, I don't currently have a different approach that I love. The 
24-hour revocation timeline enforced by the BRs for certain kinds of 
revocations means that clients should be checking at least once every 24 hours, 
regardless of mechanism. I'll comment more on your specific proposals to 
address this below.

4) Crafting the URL is convoluted. As Peter Cooper described it, "The core 
issue is that the URL you need to construct is based on an OCSP structure 
identifying the certificate, which requires taking one's existing certificate 
and parsing out the serial number and issuer, and also taking the intermediate 
certificate that signed it and getting its public key too. So rather than just, 
like, using the fingerprint of the existing leaf or something similarly simple 
that a lot of tooling can already give you, one needs to really dig into both 
the leaf, and the intermediate, and hash various pieces thereof, and then take 
all that to build a new ASN.1 structure." Why are we striving for near-parity 
with an OCSP request?? This should be orthogonal to OCSP, right?

This is great feedback. We picked this request format specifically because we 
thought it would be easy. It's good to know that we were wrong, and investigate 
what other request formats would work better.

Allow me to provide a little bit of context for how we arrived at using the 
OCSP CertID structure:

We need a way to uniquely identify the certificate in question. ACME has one 
mechanism for doing so already: the URL provided by a finalized Order. 
Personally, my ideal would be to say "the ARI url is the Certificate URL 
concatenated with /ari". Unfortunately we can't do that, because there's 
nothing to prevent the URL provided by an Order from having query parameters, 
in which case appending a new path component would be incorrect. So, we could 
follow ACME's example, and provide a second "renewalInfo" URL in finalized 
Orders as well. Unfortunately, this a) means that clients have to persist this 
URL in order to use it, and b) clients which did not persist the URL (either 
ephemeral clients, or third-party certificate monitoring clients) cannot 
construct the URL at all.

So we need a way to uniquely identify a certificate which can be constructed 
from the certificate itself. The serial seems like an obvious candidate. 
However, serials are only required to be unique on a per-issuer basis, and a 
single ACME server may issue from multiple issuer certificates. It turns out 
that OCSP already has a solution for this: combine the serial with a unique 
identifier of the issuer. And OCSP's solution even comes with algorithm agility 
for how the unique identifier of the issuer is computed! That's nice. So we 
took OCSP's request format, stripped away the pieces not pertaining to 
identifying a single certificate, et voila, the CertID.

We believed this would be easy because many ACME clients are written in 
languages or running in environments that already have access to robust OCSP 
libraries. I wrote the first version of 
this<https://github.com/letsencrypt/boulder/blob/73b72e8fa2d852a40753926c34f38313a7db083d/wfe2/wfe_test.go#L3517-L3538>
 (constructing an OCSP request, parsing it, extracting the relevant parameters, 
and serializing them into a CertID) in a few minutes. Again, it's useful to 
know that we were wrong.

This leads to the question of: what should we use to uniquely identify the 
certificate instead? Certainly we could go with the "fingerprint" or 
"thumbprint" (a sha256 hash of DER bytes or PEM encoding, depending on who you 
ask, of the certificate) if people think that is sufficiently simple, easy to 
specify, unique, and future-proof. We could also go with "just the Serial", and 
force existing ACME servers to choose between either keeping serials unique 
across all issuers they represent, or splitting the server into multiple 
servers which each represent just a single issuer. Or we could return to the 
"url in the Order object" approach we started with. I'm curious what path 
forward people think is best.

5) Web browsers / HTTP clients are bound to "abuse" ARI because the GET request 
is not authenticated. Even if the information is not strictly sensitive, I can 
totally see some browsers or tools using ARI as a signal that a certificate is 
being revoked, and thus can no longer be trusted, and thus block a site before 
a server even sees that it needs to renew its cert. I could be incorrect, but 
can't the information needed to obtain ARI can be scraped from CT logs? If so, 
I think a global ARI monitor/database is inevitable, and that has interesting 
implications that I don't know have been fully realized.

Yes, as mentioned above, this was a design goal as a result of community 
feedback. See this early 
discussion<https://mailarchive.ietf.org/arch/msg/acme/szDHa5z6qRiAtmeC2ohrePPoBjU/>
 for context. Again, this is a design goal that I'd be willing to compromise if 
there are sufficient reasons to do so, but I don't think that argument has been 
fully articulated as of yet.

All in all, the current ARI spec feels a little rushed. I'm hoping Let's 
Encrypt's production deployment is meant to help gather feedback about ARI 
before finalizing it, rather than to solidify it. Can we revisit both its 
fundamentals and practical implications too?

Yes, the IETF process is about "rough consensus and running code". We can't 
finalize the spec until something is running. Let's Encrypt's deployment, and 
our encouragement of client adoption, is so that we can receive precisely this 
kind of feedback before the draft becomes an RFC.

I would like to explore some alternatives to the current draft. I can think of 
two approaches that might address these concerns:

A) Instead of a totally separate flow to obtain ARI, simply utilize a 
Retry-After header in the flow of existing ACME responses. Upon finalizing an 
order, the ACME server can respond with a Retry-After header which acts as the 
current-draft Retry-After header for ARI responses. The client then attempts 
renewal at/after the Retry-After time, but with the OCSP CertID added to the 
NewOrder object; this indicates to the ACME server that the client is asking if 
now is a good time to renew the certificate indicated by the CertID. If it's 
not a good time, the ACME server can reply as such, with another Retry-After, 
and the client then waits and repeats, until the server actually issues the 
certificate. If the client needs the certificate immediately, simply omit the 
CertID from the NewOrder and the normal, "non-ARI" flow is assumed. This is 
backwards-compatible and requires no additional infrastructure or endpoints.

I don't understand how this approach helps solve the issues you identified 
above. In order to get up-to-date information, the same number of requests 
still need to be made, it's just that now they're newOrder requests instead of 
renewalInfo requests. The unique identifier included in the request is no 
easier to construct. The Retry-After timestamp changing might still cause 
selfish clients to stop providing the CertID and renew right now.

Now, I am a fan of adding a field to newOrder requests which uniquely 
identifies the cert being replaced. If such a field is populated, the CA would 
treat it the same as if the client had made a POST request to mark the 
certificate as replaced (Section 4.2 of the current draft). This has many nice 
effects, like letting the CA track renewals explicitly (instead of attempting 
to identify them with heuristics), letting renewal requests bypass rate limits, 
and more. I just don't think it elegantly replaces the renewalInfo endpoint 
itself.

B) If we do need a separate flow for some reason, I would like to see a single 
endpoint containing a static JSON resource that describes all the active 
certificates that need early renewal, rather than one tediously-crafted URL per 
certificate. Certificates can be described by their NotBefore or NotAfter 
dates, serial numbers, or other relevant attributes. For example, if just a few 
certs with certain serials were misissued, those serials could be enumerated at 
this endpoint. Or if a mass revocation is happening, the timeframe of NotBefore 
dates could be listed, and ACME clients can simply check against the certs they 
manage with those dates, and replace them. You can represent millions of 
certificates in, like, 85 bytes this way. And it's way less work for clients 
and servers. And lastly, drop the "window" idea -- certificates described by 
this endpoint should be renewed ASAP: try to renew immediately, then back off 
and retry, for reasons described above (once we know the future is uncertain 
and/or revocation is imminent, current certs can't be trusted and/or clients 
must try to preserve their sites' uptime).

On the one hand, I'm in complete agreement, it would be great to have a "batch" 
endpoint that returns suggested windows for all certificates associated with a 
given account, or matching some other criteria. On the other hand, there's a 
reason that Let's Encrypt diverges from RFC8555 and does not implement the 
"orders" field on account objects: endpoints which serve unboundedly-large 
documents and require paging are difficult to implement correctly on both the 
server and client side, and can quickly lead to disruptive database queries.

And finally, I want to bring attention to the longer-term prospects for ARI: 
it's quite possible that ARI will become irrelevant before it is widely adopted 
by most clients. This itself may discourage adoption. As stated above, ARI has 
two primary use cases: revocation and traffic smoothing. As we push for shorter 
certificate lifetimes, revocation should become irrelevant. And traffic 
smoothing will perhaps become a natural consequence as clients are renewing 
more frequently anyway. We all know revocation and long-lived certificates are 
broken, so I'd rather WebPKI developers focus our energy on the ACTUAL goal: 
short-lived certificates. We should not be focusing our ecosystem resources on 
infrastructure that acts as a band-aid for a broken leg.

This is an interesting point. ARI was first 
conceived<https://bugzilla.mozilla.org/show_bug.cgi?id=1619179#c7> as a way to 
improve business continuity across mass revocation events, and grew from there. 
The idea that 10-day certs might be a reality, and that revocation would be 
wholly optional for them, was almost unimaginable at that time. But even today, 
the reality is that CAs such as Let's Encrypt will likely have to support 
revocation for a very long time to come: migrating the whole world to 10-day 
certs will not happen overnight. So I think that this work is worthwhile, even 
if other solutions are also on the horizon.

Thanks,
Aaron

_______________________________________________
Acme mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/acme

Re: [Acme] Practical concerns of draft-ietf-acme-ari

Reply via email to