Hi Petr, thank you for the feedback!

> On Jul 28, 2022, at 5:06 AM, Petr Špaček <pspa...@isc.org> wrote:
> 
> Caution: This email originated from outside the organization. Do not click 
> links or open attachments unless you recognize the sender and know the 
> content is safe. 
> On 27. 07. 22 19:42, internet-dra...@ietf.org wrote:
>> A New Internet-Draft is available from the on-line Internet-Drafts 
>> directories.
>> This draft is a work item of the Domain Name System Operations WG of the 
>> IETF.
>>        Title           : Negative Caching of DNS Resolution Failures
>>        Authors         : Duane Wessels
>>                          William Carroll
>>                          Matthew Thomas
>>  Filename        : draft-ietf-dnsop-caching-resolution-failures-00.txt
> 
> I think this is an important clarification to the protocol and we should 
> adopt it and work on it.
> 
> I like the document up until end of section 2.
> 
> After that I have reservations about the specific proposals put forth in the 
> section 3.
> 
> I hope this will kick off discussion, please don't take points personally. 
> I'm questioning the technical aspects.
> 
>> 3.  DNS Negative Caching Requirements
>> 3.1.  Retries and Timeouts
>>  A resolver MUST NOT retry more than twice (i.e., three queries in
>>  total) before considering a server unresponsive.
>>  This document does not place any requirements on timeout values,
>>  which may be implementation- or configuration-dependent.  It is
>>  generally expected that typical timeout values range from 3 to 30
>>  seconds.
> 
> I'm curious about reasoning about this.
> 
> My motivation:
> Random drop or temporarily saturated/malfunctioning link should not cause 
> resolver to fail for several seconds.

This section can certainly be improved and we are open to specific suggestions. 
 For example, I think we could say “MUST NOT retry a given query more than 
twice…” i.e., tie this to the concept of scope in section 3.3. 

> As an extreme case, think of validating resolver on a laptop forwarding 
> elsewhere. Should really two packet drops cause it to servfail for several 
> seconds?

It was not our intention to say that three timeouts marks a forwarder as 
unusable for a long period of time.  Maybe there are different rules for 
forwarders vs authoritative servers.  Or maybe scoping it to individual queries 
would be sufficient.


> 
> Related to this, I have a principal objection:
> IMHO we should NOT be inventing flow control from scratch ourselves. On the 
> contrary - we should be borrowing prior art from existing flow control 
> algorithms and adapt them if necessary.

Sure, I think we’re open to that if there is something appropriate we can 
reference.  Can you think of any relevant prior art?


> 
> 
>> 3.2.  TTLs
>>  Resolvers MUST cache resolution failures for at least 5 seconds.
>>  Resolvers SHOULD employ an exponential backoff algorithm to increase
>>  the amount of time for subsequent resolution failures.  For example,
>>  the initial TTL for negatively caching a resolution failure is set to
>>  5 seconds.  The TTL is doubled after each retry that results in
>>  another resolution failure.  Consistent with [RFC2308], resolution
>>  failures MUST NOT be cached for longer than 5 minutes.
> 
> My motivation: Rapid recovery.
> 
> Why 5 seconds? Why not 1? Or why not 0.5 s? ... I would like to see reasoning 
> behind specific numbers.

We put 5 seconds here simply because it feels like a reasonable amount of time 
that a person would be willing to wait for a retry, and as a starting point for 
a discussion (which we are now having — hooray!).  


> 
> IMHO most problems is caused by unlimited retries and as soon as _a_ limit is 
> in place the problem is alleviated,

But the limit needs to be bound by some amount of time, right?  

> and with exponential backoff we should be able to start small. I'm not sure 
> that a specific number should be mandated.

I agree that having exponential backoff would make a small initial TTL 
feasible.  Would you support a MUST requirement for exponential backoff?

> 
> 
>> 3.3.  Scope
>>  Resolution failures MUST be cached against the specific query tuple
>>  <query name, type, class, server IP address>.
> 
> Why this tuple was selected? Why not <class, zone, server IP> for, say, 
> timeouts? Or why not <server IP> for timeouts?

This was copied from RFC 2308 (section 7.1 and 7.2).  


> What about transport protocol and its parameters? (TCP, UDP, DoT...) etc.

Yes that is an aspect the draft hasn’t considered.   Would you like to see that 
included in the tuple?

> 
> My motivation:
> - Simplify cache management.
> - Imagine an attacker attempting to misuse this new cache. The cache has to 
> be bounded in size. It has to somehow manage overflow etc.
> 
> Generally I think this MUST is too prescriptive. It should allow for less 
> specific caching if an implementation decides it is fit for a given type of  
> failure and configuration, or depending on operational conditions.


This is similar to points raised by Mukund.  How would you feel about something 
like this:

MUST at least cache against <server IP address>

SHOULD cache against <name, type, class, address>

MAY cache against <name, type, class, address, transport>

DW
_______________________________________________
DNSOP mailing list
DNSOP@ietf.org
https://www.ietf.org/mailman/listinfo/dnsop

Reply via email to