Some comments quickly browsing this draft, as we're handling a quirky
issue around NS timeouts and it looked relevant.

Firstly, some resolver implementations do cache upstream NS timeouts in
various non-standard ways. The resolver I work on has at least 3-4
different mechanisms within the same codebase. Documentation on how
timeouts should be handled seems good, so I support this draft.

> Internet Engineering Task Force                               D. Wessels
> Internet-Draft                                                W. Carroll
> Intended status: Standards Track                               M. Thomas
> Expires: 17 July 2022                                           Verisign
>                                                          13 January 2022


>               Negative Caching of DNS Resolution Failures
>            draft-dwmtwc-dnsop-caching-resolution-failures-00

[snip]

>    [RFC4697] is a Best Current Practice that documents observed
>    resolution misbehaviors.  It describes a number of situations that
>    can lead to excessive queries from recusrive resolvers. including:

There's a spelling mistake in "recusrive", and the period after
"resolvers." should be removed.

[snip]

> 3.2.  TTLs

>    Resolvers MUST cache resolution failures for at least 5 seconds.
>    Resolvers SHOULD employ an exponential backoff algorithm to increase
>    the amount of time for subsequent resolution failures.  For example,
>    the initial negative cache TTL is set to 5 seconds.  The TTL is

I am guessing the authors meant to write "timeout cache TTL" here
instead of negative cache TTL. In any case, the phrase "negative cache
TTL" has a well-understood meaning per RFC 2308, and should not be
overloaded/reused to indicate timeout cache TTL.

[snip]

> 3.3.  Scope

>    Resolution failures MUST be cached against the specific query tuple
>    <query name, type, class, server IP address>.

Have you considered the effect of caching the timeout against just an
upstream server's IP address? I'm not saying you should, but wondering
if any of the other tuple fields are relevant to have separate
more-specific timeout cache entries.

In other words, is it necessary for there to be a distinction among
timeouts for:

(1) example.org., A, IN, 10.0.0.1

(2) example.org., TYPE65, IN, 10.0.0.1

(3) example.com., A, IN, 10.0.0.1

Traditionally, a resolver's upstream RTTs and timeouts are tracked
against the nameserver IP address. A failure to respond has been
considered as a property of the NS (implementation) or path to that NS.

My colleagues are handling an issue where an authoritative nameserver
does not respond to TYPE65 queries, but responds to queries for common
query types such as address records. In this case, without mitigating
with controls, the resolver is a little stumped and keeps attempting to
contact the upstream NS because it receives some responses from it. The
queries for which there are no responses eventually end up waiting for
the maximum timeout limit because the resolver keeps trying to talk to
it. On a busy resolver, these queries consume resources.

We could consider the upstream NS as "bad" if it appears to respond to
some queries but doesn't respond to others with some response. But
one-off or transient timeouts can occur sometimes due to network packet
loss.

In our case, if the resolver were to block this zone's upstream NSs as
bad, it wouldn't be able to respond to any queries within that zone
(even address records). It appears to be a popular country-level zone,
and it's unlikely the upstream operators will fix it to respond to
TYPE65 queries in the short-term. In such cases, a heavy-handed approach
may not be practical.

                Mukund

Attachment: signature.asc
Description: PGP signature

_______________________________________________
DNSOP mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/dnsop

Reply via email to