On 27. 07. 22 19:42, internet-dra...@ietf.org wrote:
A New Internet-Draft is available from the on-line Internet-Drafts directories.
This draft is a work item of the Domain Name System Operations WG of the IETF.

         Title           : Negative Caching of DNS Resolution Failures
         Authors         : Duane Wessels
                           William Carroll
                           Matthew Thomas
   Filename        : draft-ietf-dnsop-caching-resolution-failures-00.txt

I think this is an important clarification to the protocol and we should adopt it and work on it.

I like the document up until end of section 2.

After that I have reservations about the specific proposals put forth in the section 3.

I hope this will kick off discussion, please don't take points personally. I'm questioning the technical aspects.

3.  DNS Negative Caching Requirements

3.1.  Retries and Timeouts

   A resolver MUST NOT retry more than twice (i.e., three queries in
   total) before considering a server unresponsive.

   This document does not place any requirements on timeout values,
   which may be implementation- or configuration-dependent.  It is
   generally expected that typical timeout values range from 3 to 30
   seconds.

I'm curious about reasoning about this.

My motivation:
Random drop or temporarily saturated/malfunctioning link should not cause resolver to fail for several seconds. As an extreme case, think of validating resolver on a laptop forwarding elsewhere. Should really two packet drops cause it to servfail for several seconds?

Related to this, I have a principal objection:
IMHO we should NOT be inventing flow control from scratch ourselves. On the contrary - we should be borrowing prior art from existing flow control algorithms and adapt them if necessary.


3.2.  TTLs

   Resolvers MUST cache resolution failures for at least 5 seconds.
   Resolvers SHOULD employ an exponential backoff algorithm to increase
   the amount of time for subsequent resolution failures.  For example,
   the initial TTL for negatively caching a resolution failure is set to
   5 seconds.  The TTL is doubled after each retry that results in
   another resolution failure.  Consistent with [RFC2308], resolution
   failures MUST NOT be cached for longer than 5 minutes.

My motivation: Rapid recovery.

Why 5 seconds? Why not 1? Or why not 0.5 s? ... I would like to see reasoning behind specific numbers.

IMHO most problems is caused by unlimited retries and as soon as _a_ limit is in place the problem is alleviated, and with exponential backoff we should be able to start small. I'm not sure that a specific number should be mandated.


3.3.  Scope

   Resolution failures MUST be cached against the specific query tuple
   <query name, type, class, server IP address>.

Why this tuple was selected? Why not <class, zone, server IP> for, say, timeouts? Or why not <server IP> for timeouts?
What about transport protocol and its parameters? (TCP, UDP, DoT...) etc.

My motivation:
- Simplify cache management.
- Imagine an attacker attempting to misuse this new cache. The cache has to be bounded in size. It has to somehow manage overflow etc.

Generally I think this MUST is too prescriptive. It should allow for less specific caching if an implementation decides it is fit for a given type of failure and configuration, or depending on operational conditions.

--
Petr Špaček

_______________________________________________
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop

Reply via email to