[Acme] A single failed challenge should not invalidate an entire order

Matt Holt Tue, 18 Aug 2020 13:17:54 -0700

Hi,

After working heavily with ACME clients for the past 5 years (including 
"ACMEv1" before RFC 8555) I've come to realize some unfortunate 
ambiguities/inefficiencies in RFC 8555 with regards to server behavior after a 
challenge is attempted and failed by the client.


I recently implemented an RFC 8555-compliant client library in Go 
(https://github.com/mholt/acmez), and am convinced that a simple revision to 
the spec can both reduce costs for CAs *and* greatly simplify client 
implementations, if only the handling of failed challenges is revised.

My realizations are spelled out in this commit: 
https://github.com/mholt/acmez/commit/80adb6d5e64a3d36a56c58c66965b131ea366b8c

In summary: to get a certificate, a client creates an Order. The client then 
has to validate all Authorizations ("authzs"). For each Authorization, the 
client needs to successfully complete one of the offered Challenges. One 
successful challenge is sufficient to validate the authz. However, one failed 
challenge is apparently sufficient to invalidate the authz, and thus the entire 
Order. To try another challenge, the client then has to deactivate the other 
Authorizations (expensive) and create a new Order (also expensive), repeating 
the whole process. Instead, the client should be able to simply try the next 
challenge. In other words, a single failed challenge should not invalidate an 
authz; an authz should be "pending" until all offered challenges have failed or 
one has succeeded.

The commit I linked to above corrected my initial (overly-optimistic) 
interpretation of RFC 8555 where if a challenge fails, I simply need to try 
another one. This correction involves creating a whole new order and adds 250 
lines of code, nearly double the complexity to handle the most common failure 
scenario. Not to mention the added cost of the DB transactions the CA has to 
deal with to invalidate an entire order.

The ACME spec allows a server to offer an array of challenges for each authz. 
In practice, there is no point offering more than one challenge if only one can 
ever be used.

I propose that RFC 8555 §7.5.1 be revised to say, "The server is said to 
"finalize" the authorization when it has successfully completed one of the 
challenges or failed all of them."

My commit message is quoted below, for convenience. It goes into more detail 
about the difficulties of the current spec (pardon any stream of consciousness 
as I was writing this deep in "developer mode"):

The ACME spec (RFC 8555) is somewhat ambiguous/conflicting about
finalizing authorizations. In §7.1.4 it says:

      client should attempt to fulfill one of these challenges, and a
      server should consider any one of the challenges sufficient to
      make the authorization valid.

This makes it sound like solving any one of the possible challenges for
an authz is sufficient to make an authz "valid".

And here it says if any one of the challenges fails, the entire authz is
considered "invalid" (§7.5.1):

   The server is said to "finalize" the authorization when it has
   completed one of the validations.  This is done by assigning the
   authorization a status of "valid" or "invalid", corresponding to
   whether it considers the account authorized for the identifier.

To my dismay, it appears that if any one of the challenges listed for an
authz are marked "invalid", indeed the entire order fails. This means
that a server may offer http-01, tls-alpn-01, and dns-01 challenges for
an authz, and if a client tries tls-alpn-01 and fails, it cannot simply
try http-01.

This is very unfortunate, as this is a very common use case, especially
in deployments where site owners don't control their customers' domain
names. We see a lot of cases where port 443 has TLS termination before
the ACME client (breaking the tls-alpn-01 challenge), but where port 80
is open and the http-01 challenge would succeed. We also see the reverse,
where port 80 is blocked but port 443 is open. There is often no way
for the client to know this ahead of time because it does not have an
outside perspective.

Because a single failed challenge invalidates *the entire authz* even
though other challenges *offered by the server as acceptable options*
are still perfectly capable of succeeding, we need to cancel the order
(which involves deactivating the remaining authorizations one-by-one)
and make a new one.

SUPER unfortunately, newOrder calls are rate-limited by Let's Encrypt,
effectively halving even a correctly-implemented, robust, and well-
behaved ACME client's management capacity. Orders are also associated
with a lot of state, and as such, are expensive database transactions
on the server-side. Further, client-side logic is forced to be much
more complex in order to correctly take advantage of all offered
challenge types. Clients that don't do this effectively ignore all but
the first, making it pointless to offer more than one challenge type in
the first place!

The previous logic was much cleaner and more elegant: an order was
created, its authorizations were iterated, and each authorization's
challenges were iterated until one succeeded. If any authorization
failed (i.e. all challenges failed), it simply returned that error and
the order was cancelled (other authorizations were deactivated). This
kept all error-handling and retry state local to the respective loops:

    authzs -> challenges

That was the previous logic. Now, we have a third loop:

    order retries -> authzs -> challenges

We need to bubble retry state up to the top-most "order" loop, which
gets manipulated in the inner-most "challenge" loop. We have to carry
failure state around through the whole retry process, mapping
identifiers to challenge types in order to remember which challenges
failed for which identifiers so we don't try them again on the next
order.

Additionally, our challenge selection is necessarily made more complex.
Before, we could just randomize the order of the challenges (as a good
practice, to avoid accidental dependence on just one challenge type).
Now, because retries are expensive and complex, we absolutely need to
avoid them as much as possible. So instead of a random order, we keep
a history of challenge success rates and choose the most successful
challenge type first, every time. If it fails, we try the next-most-
successful, and so on, but each retry is part of a new order and that's
expensive.

The ACME spec forces leaky, complex abstractions, and makes writing
correct clients more difficult and error-prone than is necessary. (Just
look at this commit!) I am not aware of any good reason that the spec is
the way that it is on this point. One possibility I've heard is "it's
simpler for servers that way" and "free CAs want to keep their costs
down" but it's NOT simpler this way (again, look at the code), and
order transactions are *expensive* -- CAs don't want frequent polling on
order status because there is so much state attached to an order! -- but
the way the spec is written requires significantly more CPU and network
cycles than are necessary.

Because it only takes one successful challenge to mark the authz as
"valid", and because order transactions are expensive for the server,
and because the client-side logic is immeasurably more complex and
convoluted and tricky to get right this way, the current ACME spec is
nonsensical on this point. Maybe it intended to optimize for server
implementations (which it didn't do successfully, as explained), but
forgot that ACME *clients* would fill the world, not servers; and now
we have something that is unintentionally hostile toward clean, correct,
efficient, and low-cost implementations.

In summary:

The ACME protocol should changed so that an authz is not marked as
"invalid" until ALL offered challenges fail, rather than just one.

_______________________________________________
Acme mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/acme

[Acme] A single failed challenge should not invalidate an entire order

Reply via email to