Re: [Acme] Practical concerns of draft-ietf-acme-ari

Michael Sweet Fri, 23 Jun 2023 10:43:56 -0700

FWIW, I agree with Matthew's comments and conclusions.

In a somewhat-related situation for printing, we have an event notification 
interface (RFC 3996) where the printer can report back a time interval (in 
seconds) when the client should re-contact the printer to get more events.  
This is flexible enough to handle both printer/server load and to let the 
client now when it should anticipate more events, i.e., the printer is printing 
something, the event subscription is for 'job-completed', and the printer can 
estimate when the print job will complete - this is analogous to an ACME 
certificate's expiration/renewal date/time.


Personally, the servers I maintain use Let's Encrypt and have a weekly cron job 
that checks whether the server's certificate needs to be renewed.  If the ACME 
server could provide a "retry after" response then my servers (ACME clients) 
could do a better job of scheduling the next update and not bug the ACME server 
so often...


> On Jun 23, 2023, at 12:20 PM, Matthew Holt <[email protected]> wrote:
> 
> Hi all,
> 
> I don't normally participate in these mailing lists, and last time I did I 
> feel like the lack of discussion was discouraging, as what little discussion 
> did occur wasn't taken seriously and was laced with complacency. Just stating 
> up front that I don't have much hope for this message to be acted upon. That 
> said, multiple people have strongly encouraged _someone_ to write the mailing 
> list and bring the concerns of multiple ACME client developers to your 
> attention.
> 
> I speak for myself, but my views have been formed from a combination of 
> personal experience developing ACME clients and discussion with other ACME 
> client developers. So when I say "we" I do so loosely; sometimes it might 
> just be me.
> 
> First, I want to say: overall we like the idea of proactive ACME clients 
> being able to know whether a certificate needs to be replaced sooner than 
> expected, and we're glad to see an attempt at a solution drafted for 
> standardization. But some of us do not think (current draft) ARI is The Way.
> 
> Now that several ACME client authors have had the opportunity to implement 
> the spec, we've noticed some issues, both with fundamental flaws in the 
> concept of ARI and some in implementation. Initially these concerns were 
> raised at the Let's Encrypt forums:
> 
> - 
> https://community.letsencrypt.org/t/can-ari-conforming-clients-be-granted-exemptions-to-relevant-rate-limits/195600?u=mholt
> - 
> https://community.letsencrypt.org/t/thoughts-from-starting-to-play-with-ari/200276?u=mholt
> - https://community.letsencrypt.org/t/ari-rate-limits/198720?u=mholt
> - https://community.letsencrypt.org/t/ari-retry-after-header/195471?u=mholt
> 
> And the overwhelming response seems to be, "Meh, take it to the mailing 
> list." (Except for one response by LE staff about rate limits, which was 
> appreciated, at least.) So here we are.
> 
> Cutting to the chase:
> 
> With respect to ARI, ACME servers and clients have conflicts of interest. The 
> ACME client's goal is to keep the site up (with renewed and unrevoked 
> certificates); the optimal way to do this is to start renewing early and 
> retry often. The ACME server's goal is to keep the service up; the optimal 
> way to do this is to suppress clients that overload your capacity. Obviously, 
> these two goals are in opposition with each other. Proactive clients can 
> spike demand, which can cause service interruptions. But service 
> interruptions make clients more paranoid to retry even more often until it 
> works, and so on. ARI narrows the timeframe in which a conforming client can 
> retry failed renewals, which reduces reliability more as time goes on. 
> Without ARI, this window is a reasonable ~60 days. With ARI, however, the 
> window is reduced to just a few minutes, hours, or days. The less time until 
> expiration, the less hope there is to renew the cert in time. As the draft 
> currently stands, this is in the server's interest, but not the client's.
> 
> I can tell you, with the current draft, my ACME clients will use ARI as a 
> signal to immediately try renewing a certificate, not for scheduling a 
> renewal in the future.
> 
> Here's why.
> 
> The ACME client's goal is to keep the site up (with renewed and unrevoked 
> certificates). If everything always worked, we'd simply renew after about 99% 
> of the certificate's lifetime.
> 
> But obviously, that's not reality. In the presence of failures/uncertainty, 
> the optimal way to maximize uptime is to start renewing early and retry 
> often. In fact, just constantly be renewing. This offers the maximum possible 
> chances to successfully get a certificate.
> 
> But obviously, that's not reality. CAs rightly enforce rate limits, and 
> service uptime is actually Pretty Good most of the time, so we can reduce 
> network traffic, load on the CA, and pressure on CT infrastructure by waiting 
> until about 2/3 into a certificate's lifespan before trying to renew. (With 
> Let's Encrypt certificates this gives 30 days of runway.) This is a fair 
> balance and works well in practice.
> 
> But unfortunately, reality's not that simple. There are two off-nominal 
> events that are often mentioned as the motivation for ARI:
> 
> 1) Revocation
> 2) Traffic smoothing around expected maintenance or heavy load
> 
> Both of these can interfere with our happy little status-quo. Revocation 
> means we need to replace the certificate sooner than expected, and 
> maintenance or congestion means we may need to renew the certificate later 
> than expected.
> 
> Enter ARI. ARI is the CA saying, "We suggest -- but do not require -- this 
> specific timeframe within which to renew your certificate."
> 
> There are some problems with this:
> 
> 1) It is optional. No one will implement this. OK, some clients will -- but I 
> can say with authority from years of experience that optional restrictions 
> are not typically favored. Very little mainstream software follow best 
> practices to a tee.
> 
> 2) A narrower renewal timeframe makes clients less reliable. In theory it 
> should make them *more* reliable since it smooths out traffic, thus improving 
> CA availability. But this assumes that most clients actually implement and 
> follow ARI. Since it's optional, I don't see that happening. Especially since 
> most ACME clients are still running as static cron jobs like it's 2015...
> 
> I'm sure ARI doesn't really change in the nominal case, which is 99.9..9% of 
> the time. In fact, Let's Encrypt's ARI seems to correspond with when my 
> clients attempt renewals on their own anyway. (So in that sense, ARI is 
> actually useless 99.9..9% of the time?)
> 
> But when a renewal window does change, what does that mean? Well, something 
> is wrong. Either the certificate is being revoked, or the CA anticipates 
> downtime or availability issues.
> 
> Uh oh. That's bad news for a good little client which is trying its best to 
> keep its sites (potentially tens of thousands of them) online.
> 
> If we wait until the (adjusted) window to start renewing, we run ourselves 
> closer to the imminently-impending revocation or the expiration of the 
> certificate, lowering our chances of a successful renewal. If this is a mass 
> or CA-wide event, other clients have surely noticed too. Best to renew ASAP 
> and give ourselves more chances for success. Worst-case scenario, we'll retry 
> all the way into the designated window in which we expect to be able to get a 
> certificate anyway. And we might have to do this for 10s of thousands of 
> certificates.
> 
> Because ARI is optional, it only acts as an early warning for clients that 
> wish for an advantage over other clients with the same goal when resources 
> are scarce. In these conditions, it's first-come-first-serve and clients 
> compete to preserve uptime for all their sites. (I think clients can still do 
> this respectfully with backoff and jitter.)
> 
> Note that this behavior is still in compliance with the draft ARI spec, which 
> says:
> 
>     Conforming clients MUST attempt renewal at a time of their choosing
>     based on the suggested renewal window.
> 
> It doesn't say the renewal MUST be attempted "within" the window, just "based 
> on" the window. (A minor language change to the spec, by the way, will not 
> change client behaviors. I think we need to take a different approach to ARI, 
> read on.)
> 
> Anyway, a few more practical issues/questions:
> 
> 1) Many CAs enforce rate limits. If clients are to honor ARI windows, we 
> would need a guarantee that the first successful cert within the ARI window 
> will be allowed regardless of relevant rate limits. Because ARI restricts a 
> client's ability to spread out renewals when managing certificates in bulk 
> with respect to rate limits, the rate limits must NOT be a blocker when 
> honoring ARI.
> 
> 2) If ARI were actually enforced, some concerns would be resolved... for 
> example, we can have assurances that other ACME clients are doing the same, 
> thus improving CA availability. It would essentially be the CA scheduling 
> each individual certificate for each ACME client instance -- that's quite a 
> powerful idea, as long as availability is guaranteed (which it's not).
> 
> 3) ARI does not scale well. Some ACME clients manage 10K+ certificates, and 
> in that case the client would have to check the ARI for at least 24 
> certificates per hour to get through them in a month. Deferring to the 
> Retry-After header may result in insufficient throughput. The current 
> expectation or convention is to check every certificate every 6-12 hours, or 
> tens of thousands of checks per day. One endpoint per certificate multiple 
> times per day is quite saturating. This is a considerable burden for both 
> ACME clients and servers. I would like to explore options that do not involve 
> 2+ HTTP requests per certificate.
> 
> 4) Crafting the URL is convoluted. As Peter Cooper described it, "The core 
> issue is that the URL you need to construct is based on an OCSP structure 
> identifying the certificate, which requires taking one's existing certificate 
> and parsing out the serial number and issuer, and also taking the 
> intermediate certificate that signed it and getting its public key too. So 
> rather than just, like, using the fingerprint of the existing leaf or 
> something similarly simple that a lot of tooling can already give you, one 
> needs to really dig into both the leaf, and the intermediate, and hash 
> various pieces thereof, and then take all that to build a new ASN.1 
> structure." Why are we striving for near-parity with an OCSP request?? This 
> should be orthogonal to OCSP, right?
> 
> 5) Web browsers / HTTP clients are bound to "abuse" ARI because the GET 
> request is not authenticated. Even if the information is not strictly 
> sensitive, I can totally see some browsers or tools using ARI as a signal 
> that a certificate is being revoked, and thus can no longer be trusted, and 
> thus block a site before a server even sees that it needs to renew its cert. 
> I could be incorrect, but can't the information needed to obtain ARI can be 
> scraped from CT logs? If so, I think a global ARI monitor/database is 
> inevitable, and that has interesting implications that I don't know have been 
> fully realized.
> 
> All in all, the current ARI spec feels a little rushed. I'm hoping Let's 
> Encrypt's production deployment is meant to help gather feedback about ARI 
> before finalizing it, rather than to solidify it. Can we revisit both its 
> fundamentals and practical implications too?
> 
> I would like to explore some alternatives to the current draft. I can think 
> of two approaches that might address these concerns:
> 
> A) Instead of a totally separate flow to obtain ARI, simply utilize a 
> Retry-After header in the flow of existing ACME responses. Upon finalizing an 
> order, the ACME server can respond with a Retry-After header which acts as 
> the current-draft Retry-After header for ARI responses. The client then 
> attempts renewal at/after the Retry-After time, but with the OCSP CertID 
> added to the NewOrder object; this indicates to the ACME server that the 
> client is asking if now is a good time to renew the certificate indicated by 
> the CertID. If it's not a good time, the ACME server can reply as such, with 
> another Retry-After, and the client then waits and repeats, until the server 
> actually issues the certificate. If the client needs the certificate 
> immediately, simply omit the CertID from the NewOrder and the normal, 
> "non-ARI" flow is assumed. This is backwards-compatible and requires no 
> additional infrastructure or endpoints.
> 
> B) If we do need a separate flow for some reason, I would like to see a 
> single endpoint containing a static JSON resource that describes all the 
> active certificates that need early renewal, rather than one 
> tediously-crafted URL per certificate. Certificates can be described by their 
> NotBefore or NotAfter dates, serial numbers, or other relevant attributes. 
> For example, if just a few certs with certain serials were misissued, those 
> serials could be enumerated at this endpoint. Or if a mass revocation is 
> happening, the timeframe of NotBefore dates could be listed, and ACME clients 
> can simply check against the certs they manage with those dates, and replace 
> them. You can represent millions of certificates in, like, 85 bytes this way. 
> And it's way less work for clients and servers. And lastly, drop the "window" 
> idea -- certificates described by this endpoint should be renewed ASAP: try 
> to renew immediately, then back off and retry, for reasons described above 
> (once we know the future is uncertain and/or revocation is imminent, current 
> certs can't be trusted and/or clients must try to preserve their sites' 
> uptime).
> 
> And finally, I want to bring attention to the longer-term prospects for ARI: 
> it's quite possible that ARI will become irrelevant before it is widely 
> adopted by most clients. This itself may discourage adoption. As stated 
> above, ARI has two primary use cases: revocation and traffic smoothing. As we 
> push for shorter certificate lifetimes, revocation should become irrelevant. 
> And traffic smoothing will perhaps become a natural consequence as clients 
> are renewing more frequently anyway. We all know revocation and long-lived 
> certificates are broken, so I'd rather WebPKI developers focus our energy on 
> the ACTUAL goal: short-lived certificates. We should not be focusing our 
> ecosystem resources on infrastructure that acts as a band-aid for a broken 
> leg.
> 
> That said, I'm not opposed to the general idea of a renewal hint for clients 
> in the meantime as long as it's simple, makes fundamental sense, and is 
> actually effective. I think the issues described above are mostly solvable 
> and now hopefully we can get there from here.
> 
> _______________________________________________
> Acme mailing list
> [email protected]
> https://www.ietf.org/mailman/listinfo/acme

________________________
Michael Sweet

signature.asc
Description: Message signed with OpenPGP

_______________________________________________
Acme mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/acme

Re: [Acme] Practical concerns of draft-ietf-acme-ari

Reply via email to