On Tue, Oct 05, 2021 at 11:04:14AM -0700, Paul Vixie wrote: > > > Frederico A C Neves wrote on 2021-10-05 09:01: > > ... > > > > Anyway I think that even though the incident was not DNS related "We", > > as the DNS community, could probably do better in future events. > > > > I would like to start a discussion or to hear implenters and operators > > of Full-service resolvers on what would be the best software > > architecture or best current configuration practice to handle a > > traffic pattern when a very popular name enters a scenario were all > > the auth-servers are timing-out or network unreachable.
Some BIND derivatives such as ours (I don't know if ISC BIND retained such a patch) have a holddown timer feature which caches timeouts when contacting upstream nameservers and backs off for a bit when the server continues to remain unreachable. There's also a downstream servfail cache in the nameserver path of newer versions of BIND. Some resolvers also have implemementations of the serve-stale(TTL) RFC, and the Unbound-like behavior of answering expired answers from cache first even before attempting resolution. We will be checking what the effect of this was during the Facebook outage, at least for DNS answers from cache. > was cache miss deduplication by q-tuple ever standardized? it is a nec'y > part of kaminsky resistance and so ought to be part of whatever BCP corpus > comes about. pending upstream query behaviour would be an expansion on cache > miss dedup by q-tuple, such that a rising tide of timeouts would yield > probabilistic prediction of servfail for cache misses aimed at the affected > <zone,auth>. Due to a transient bug in NIOS BIND that unfortunately entered operations, we found the hard way what the cost of not de-duplicating cache miss -> resolutions was. It's not just that the resolver chokes, but it also floods upstream nameservers and becomes a bad internet citizen. We fixed it and the following draft was written (section 3.1): https://datatracker.ietf.org/doc/html/draft-muks-dnsop-dns-thundering-herd-00 IIRC, when this draft was published, a reviewer mentioned that the de-duplication of upstream queries behavior is also recommended in some other RFC already for different reasons. Mukund > > in 2003, i implemented this as a form of negative caching, where the > negativity spectrum included timeouts, refuseds, and servfails -- not just > nxdomain. this worked well but needed refinement and the implementation was > not open-source. so, you and i with rodney joffe published "resimprove" > containing some of these ideas, but it has taken some decades to get these > accepted. > > i hope you succeed in this rekindling, and i would join any such effort. > when it comes to authority dns responses to cache miss transactions, recent > nonperformance is an excellent reason to predictively fail rather than > packing good on top of bad. distributed state can be treated as a mass-like > quantity, so that its inertia can be conserved at design time. > > vixie > > -- > Sent from Postbox <https://www.postbox-inc.com> > _______________________________________________ > dns-operations mailing list > [email protected] > https://lists.dns-oarc.net/mailman/listinfo/dns-operations
signature.asc
Description: PGP signature
_______________________________________________ dns-operations mailing list [email protected] https://lists.dns-oarc.net/mailman/listinfo/dns-operations
