FWIW, I agree with Matthew's comments and conclusions. In a somewhat-related situation for printing, we have an event notification interface (RFC 3996) where the printer can report back a time interval (in seconds) when the client should re-contact the printer to get more events. This is flexible enough to handle both printer/server load and to let the client now when it should anticipate more events, i.e., the printer is printing something, the event subscription is for 'job-completed', and the printer can estimate when the print job will complete - this is analogous to an ACME certificate's expiration/renewal date/time.
Personally, the servers I maintain use Let's Encrypt and have a weekly cron job that checks whether the server's certificate needs to be renewed. If the ACME server could provide a "retry after" response then my servers (ACME clients) could do a better job of scheduling the next update and not bug the ACME server so often... > On Jun 23, 2023, at 12:20 PM, Matthew Holt <[email protected]> wrote: > > Hi all, > > I don't normally participate in these mailing lists, and last time I did I > feel like the lack of discussion was discouraging, as what little discussion > did occur wasn't taken seriously and was laced with complacency. Just stating > up front that I don't have much hope for this message to be acted upon. That > said, multiple people have strongly encouraged _someone_ to write the mailing > list and bring the concerns of multiple ACME client developers to your > attention. > > I speak for myself, but my views have been formed from a combination of > personal experience developing ACME clients and discussion with other ACME > client developers. So when I say "we" I do so loosely; sometimes it might > just be me. > > First, I want to say: overall we like the idea of proactive ACME clients > being able to know whether a certificate needs to be replaced sooner than > expected, and we're glad to see an attempt at a solution drafted for > standardization. But some of us do not think (current draft) ARI is The Way. > > Now that several ACME client authors have had the opportunity to implement > the spec, we've noticed some issues, both with fundamental flaws in the > concept of ARI and some in implementation. Initially these concerns were > raised at the Let's Encrypt forums: > > - > https://community.letsencrypt.org/t/can-ari-conforming-clients-be-granted-exemptions-to-relevant-rate-limits/195600?u=mholt > - > https://community.letsencrypt.org/t/thoughts-from-starting-to-play-with-ari/200276?u=mholt > - https://community.letsencrypt.org/t/ari-rate-limits/198720?u=mholt > - https://community.letsencrypt.org/t/ari-retry-after-header/195471?u=mholt > > And the overwhelming response seems to be, "Meh, take it to the mailing > list." (Except for one response by LE staff about rate limits, which was > appreciated, at least.) So here we are. > > Cutting to the chase: > > With respect to ARI, ACME servers and clients have conflicts of interest. The > ACME client's goal is to keep the site up (with renewed and unrevoked > certificates); the optimal way to do this is to start renewing early and > retry often. The ACME server's goal is to keep the service up; the optimal > way to do this is to suppress clients that overload your capacity. Obviously, > these two goals are in opposition with each other. Proactive clients can > spike demand, which can cause service interruptions. But service > interruptions make clients more paranoid to retry even more often until it > works, and so on. ARI narrows the timeframe in which a conforming client can > retry failed renewals, which reduces reliability more as time goes on. > Without ARI, this window is a reasonable ~60 days. With ARI, however, the > window is reduced to just a few minutes, hours, or days. The less time until > expiration, the less hope there is to renew the cert in time. As the draft > currently stands, this is in the server's interest, but not the client's. > > I can tell you, with the current draft, my ACME clients will use ARI as a > signal to immediately try renewing a certificate, not for scheduling a > renewal in the future. > > Here's why. > > The ACME client's goal is to keep the site up (with renewed and unrevoked > certificates). If everything always worked, we'd simply renew after about 99% > of the certificate's lifetime. > > But obviously, that's not reality. In the presence of failures/uncertainty, > the optimal way to maximize uptime is to start renewing early and retry > often. In fact, just constantly be renewing. This offers the maximum possible > chances to successfully get a certificate. > > But obviously, that's not reality. CAs rightly enforce rate limits, and > service uptime is actually Pretty Good most of the time, so we can reduce > network traffic, load on the CA, and pressure on CT infrastructure by waiting > until about 2/3 into a certificate's lifespan before trying to renew. (With > Let's Encrypt certificates this gives 30 days of runway.) This is a fair > balance and works well in practice. > > But unfortunately, reality's not that simple. There are two off-nominal > events that are often mentioned as the motivation for ARI: > > 1) Revocation > 2) Traffic smoothing around expected maintenance or heavy load > > Both of these can interfere with our happy little status-quo. Revocation > means we need to replace the certificate sooner than expected, and > maintenance or congestion means we may need to renew the certificate later > than expected. > > Enter ARI. ARI is the CA saying, "We suggest -- but do not require -- this > specific timeframe within which to renew your certificate." > > There are some problems with this: > > 1) It is optional. No one will implement this. OK, some clients will -- but I > can say with authority from years of experience that optional restrictions > are not typically favored. Very little mainstream software follow best > practices to a tee. > > 2) A narrower renewal timeframe makes clients less reliable. In theory it > should make them *more* reliable since it smooths out traffic, thus improving > CA availability. But this assumes that most clients actually implement and > follow ARI. Since it's optional, I don't see that happening. Especially since > most ACME clients are still running as static cron jobs like it's 2015... > > I'm sure ARI doesn't really change in the nominal case, which is 99.9..9% of > the time. In fact, Let's Encrypt's ARI seems to correspond with when my > clients attempt renewals on their own anyway. (So in that sense, ARI is > actually useless 99.9..9% of the time?) > > But when a renewal window does change, what does that mean? Well, something > is wrong. Either the certificate is being revoked, or the CA anticipates > downtime or availability issues. > > Uh oh. That's bad news for a good little client which is trying its best to > keep its sites (potentially tens of thousands of them) online. > > If we wait until the (adjusted) window to start renewing, we run ourselves > closer to the imminently-impending revocation or the expiration of the > certificate, lowering our chances of a successful renewal. If this is a mass > or CA-wide event, other clients have surely noticed too. Best to renew ASAP > and give ourselves more chances for success. Worst-case scenario, we'll retry > all the way into the designated window in which we expect to be able to get a > certificate anyway. And we might have to do this for 10s of thousands of > certificates. > > Because ARI is optional, it only acts as an early warning for clients that > wish for an advantage over other clients with the same goal when resources > are scarce. In these conditions, it's first-come-first-serve and clients > compete to preserve uptime for all their sites. (I think clients can still do > this respectfully with backoff and jitter.) > > Note that this behavior is still in compliance with the draft ARI spec, which > says: > > Conforming clients MUST attempt renewal at a time of their choosing > based on the suggested renewal window. > > It doesn't say the renewal MUST be attempted "within" the window, just "based > on" the window. (A minor language change to the spec, by the way, will not > change client behaviors. I think we need to take a different approach to ARI, > read on.) > > Anyway, a few more practical issues/questions: > > 1) Many CAs enforce rate limits. If clients are to honor ARI windows, we > would need a guarantee that the first successful cert within the ARI window > will be allowed regardless of relevant rate limits. Because ARI restricts a > client's ability to spread out renewals when managing certificates in bulk > with respect to rate limits, the rate limits must NOT be a blocker when > honoring ARI. > > 2) If ARI were actually enforced, some concerns would be resolved... for > example, we can have assurances that other ACME clients are doing the same, > thus improving CA availability. It would essentially be the CA scheduling > each individual certificate for each ACME client instance -- that's quite a > powerful idea, as long as availability is guaranteed (which it's not). > > 3) ARI does not scale well. Some ACME clients manage 10K+ certificates, and > in that case the client would have to check the ARI for at least 24 > certificates per hour to get through them in a month. Deferring to the > Retry-After header may result in insufficient throughput. The current > expectation or convention is to check every certificate every 6-12 hours, or > tens of thousands of checks per day. One endpoint per certificate multiple > times per day is quite saturating. This is a considerable burden for both > ACME clients and servers. I would like to explore options that do not involve > 2+ HTTP requests per certificate. > > 4) Crafting the URL is convoluted. As Peter Cooper described it, "The core > issue is that the URL you need to construct is based on an OCSP structure > identifying the certificate, which requires taking one's existing certificate > and parsing out the serial number and issuer, and also taking the > intermediate certificate that signed it and getting its public key too. So > rather than just, like, using the fingerprint of the existing leaf or > something similarly simple that a lot of tooling can already give you, one > needs to really dig into both the leaf, and the intermediate, and hash > various pieces thereof, and then take all that to build a new ASN.1 > structure." Why are we striving for near-parity with an OCSP request?? This > should be orthogonal to OCSP, right? > > 5) Web browsers / HTTP clients are bound to "abuse" ARI because the GET > request is not authenticated. Even if the information is not strictly > sensitive, I can totally see some browsers or tools using ARI as a signal > that a certificate is being revoked, and thus can no longer be trusted, and > thus block a site before a server even sees that it needs to renew its cert. > I could be incorrect, but can't the information needed to obtain ARI can be > scraped from CT logs? If so, I think a global ARI monitor/database is > inevitable, and that has interesting implications that I don't know have been > fully realized. > > All in all, the current ARI spec feels a little rushed. I'm hoping Let's > Encrypt's production deployment is meant to help gather feedback about ARI > before finalizing it, rather than to solidify it. Can we revisit both its > fundamentals and practical implications too? > > I would like to explore some alternatives to the current draft. I can think > of two approaches that might address these concerns: > > A) Instead of a totally separate flow to obtain ARI, simply utilize a > Retry-After header in the flow of existing ACME responses. Upon finalizing an > order, the ACME server can respond with a Retry-After header which acts as > the current-draft Retry-After header for ARI responses. The client then > attempts renewal at/after the Retry-After time, but with the OCSP CertID > added to the NewOrder object; this indicates to the ACME server that the > client is asking if now is a good time to renew the certificate indicated by > the CertID. If it's not a good time, the ACME server can reply as such, with > another Retry-After, and the client then waits and repeats, until the server > actually issues the certificate. If the client needs the certificate > immediately, simply omit the CertID from the NewOrder and the normal, > "non-ARI" flow is assumed. This is backwards-compatible and requires no > additional infrastructure or endpoints. > > B) If we do need a separate flow for some reason, I would like to see a > single endpoint containing a static JSON resource that describes all the > active certificates that need early renewal, rather than one > tediously-crafted URL per certificate. Certificates can be described by their > NotBefore or NotAfter dates, serial numbers, or other relevant attributes. > For example, if just a few certs with certain serials were misissued, those > serials could be enumerated at this endpoint. Or if a mass revocation is > happening, the timeframe of NotBefore dates could be listed, and ACME clients > can simply check against the certs they manage with those dates, and replace > them. You can represent millions of certificates in, like, 85 bytes this way. > And it's way less work for clients and servers. And lastly, drop the "window" > idea -- certificates described by this endpoint should be renewed ASAP: try > to renew immediately, then back off and retry, for reasons described above > (once we know the future is uncertain and/or revocation is imminent, current > certs can't be trusted and/or clients must try to preserve their sites' > uptime). > > And finally, I want to bring attention to the longer-term prospects for ARI: > it's quite possible that ARI will become irrelevant before it is widely > adopted by most clients. This itself may discourage adoption. As stated > above, ARI has two primary use cases: revocation and traffic smoothing. As we > push for shorter certificate lifetimes, revocation should become irrelevant. > And traffic smoothing will perhaps become a natural consequence as clients > are renewing more frequently anyway. We all know revocation and long-lived > certificates are broken, so I'd rather WebPKI developers focus our energy on > the ACTUAL goal: short-lived certificates. We should not be focusing our > ecosystem resources on infrastructure that acts as a band-aid for a broken > leg. > > That said, I'm not opposed to the general idea of a renewal hint for clients > in the meantime as long as it's simple, makes fundamental sense, and is > actually effective. I think the issues described above are mostly solvable > and now hopefully we can get there from here. > > _______________________________________________ > Acme mailing list > [email protected] > https://www.ietf.org/mailman/listinfo/acme ________________________ Michael Sweet
signature.asc
Description: Message signed with OpenPGP
_______________________________________________ Acme mailing list [email protected] https://www.ietf.org/mailman/listinfo/acme
