On Fri, Dec 12, 2025 at 8:34 AM, Petr Špaček <[email protected]> wrote:

> Hello dnsop.
>
> We have encountered a DNS deployment like this:
>
> caching forwarder (forwards to)
> -> anycast IP
> -> load balancer level 1
> -> load balancer level 2
> -> recursive resolver
>
> The trouble is, each layer uses a different timeout and retry strategy and
> this caused very interesting behaviors where some layers abandoned original
> request and resent it, while other layers were still trying to resolve an
> answer nobody was waiting for. With sort of snowball effect.
>
> One possibility to counter that a new ENDS option:
> - 16 bit value as number of milliseconds the client is willing to wait.
> - Requester SHOULD substract RTT to responder if it is known.
> - Responder SHOULD use that as an upper bound for its own timeout.
> - If timeout expires, responder SHOULD send back SERVFAIL with suitable
> EDE code 'user specified timeout expired'.
>
> I sense such an option could prevent the situation when layer#1 timed out
> and resent the query to (different) layer#2 instance, while first instances
> of layers #2, #3, and #4 are still waiting and doing their own recursion
> and retries.
>
> I think it would be useful for complicated forwarding setups even if stubs
> don't see the need or take long time to adopt it.
>
> What this group thinks? Worth a draft?
>


Personally I'd much rather an *operational* document describing how setups
like the above are a bad idea and are likely to come back and bite you.
"Doctor, doctor, it hurts when I do this…."

There is a massive amount of tribal knowledge about how to build, run and
deploy DNS services, but we haven't really done a great job of writing that
down.

Back in February of this year Puneet and I started writing down some "DNS
Best Operational Practices"[0]. The plan is / was to just collect all of
the shared **operational** knowledge in one place ()sort of like a big
knowledge base), and then go through and break it out into advice for
Authoritative Operators (Large and Small), Recursive Operators (Large,
Small, Enterprise), Common (e.g "You should monitor stuff!"), etc[1]. These
are then ideally brought to DNSOP (or similar) and published as RFCs if
appropriate, or something like a Guidebook / How To for introductory and
deployment advice.

Link:
https://docs.google.com/document/d/1A0dJX4LNiyFDjK-ECR6hMD_l1JoKF1nprOWX5noEXrU/edit?usp=sharing


I explicitly do not want this to lead to confusion around where
"protocol-like" or consensus decisions get made, nor into fights around
things like iterations, MTU, etc - and so the initial version of the
document was just "Here is operational advice from RFCs" - this keeps it
clear that DNSOP is where standardization happens, and this is more
collection and collation of exiting advice. I think that this is a very
very important principle to keep in mind - this document (and things which
spring from it) are "Here is guidance, commentary, clarification,
exposition, interpretation, annotation, and elaboration based on RFCs" -
basically something like the long awaited "Hitchhikers Guide to the DNS -
written for people actually *using* this stuff".

W


[0]: Somewhat modeled on the NOG BCOPs model.
[1]: Yes, I am aware of the original DNS-OARC panel; that is what
kickstarted this document. I have shared it with a few people
[2]: This is currently "Anyone with the link can view" - please poke me if
you can comment / edit access. I was about to send it out like that, but
then figured I didn't want to deal with potential spam….



> --
> Petr Špaček
> Internet Systems Consortium
>
> _______________________________________________
> DNSOP mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>
>
_______________________________________________
DNSOP mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to