Based on the discussion we will be changing the schema default from 0 to 5 for now; with the knowledge that this is a complex issue that could benefit from ensuring we are following the relevant RFCs and perhaps a configurable default in the future.
On 12/5/17, 9:01 AM, "Jeff Elsloo" <[email protected]> wrote: I think this discussion has drifted far from Dylan's original intent, which is to set a reasonable default in the short term. We can argue about what the default is, but ultimately the real way to fix this is to ensure that we follow the RFCs. If a resolver cannot switch to TCP, we can truncate the response and set the truncated header bit. This would occur, as Eric mentioned indirectly, when EDNS0 is unsupported. Additionally, when it is supported, the client could be asking for DNSSEC signatures, which further increases the response size. It does not make sense for a resolver to support ENDS0 and not be able to switch to TCP. We shouldn't have to worry about this scenario because in my opinion it's a misconfiguration on the other side that we cannot control, therefore we should not code for it because they are not following standards. All of the commentary about what we should set the default to in order to ensure cache efficiency is highly site specific. Not everyone specs their caches for 18Gbps, and not everyone has the same cache to cache group ratios. While I appreciate that this change does impact cache efficiency, there are other aspects of Traffic Router that impact this setting such as `consistent.dns.routing`, which by default, is set to false. When it's false, your answer size will be limited by the specified amount, but the entire list will be shuffled prior to setting the limit. This will kill any cache efficiency conversation unless the operator has set this value to true. I don't believe there's a "one size fits all" answer here, and because of this we should really follow the RFCs. I think a reasonable default is a good short term solution until more time can be invested in ensuring that we are 100% compliant with this aspect of the RFCs. Ideally the default would be a parameter or something that is configurable instead of being part of the schema, but that's an entirely different argument. I'm +1 on a reasonable default. Here's a helpful post about when resolvers switch to TCP: https://serverfault.com/questions/698251/how-does-the-dns-protocol-switch-from-udp-to-tcp Thanks, -- Thanks, Jeff On Tue, Dec 5, 2017 at 8:33 AM, Dave Neuman <[email protected]> wrote: > Hey Dylan, > I think since we currently default to 0 (all) and we don't want to > re-invent the wheel right now, I think 5 sounds like a reasonable default. > > Thanks, > Dave > > On Tue, Dec 5, 2017 at 8:21 AM, Durfey, Ryan <[email protected]> > wrote: > >> Not sure if EDNS(0) extensions would make a difference here. >> >> The real issue for caching is balancing load across many caches while >> restricting content to as few caches as possible to maintain cache >> efficiency. Too few DNS answers risks load piling up on a few caches and >> overrunning them (though this is unlikely except in the case of very high >> throughput). Too many DNS answers (much more likely) spreads your >> service’s content across too many caches and increases the cache churn and >> risk of hitting cold caches and having poor service performance. >> >> I spoke with our DNS team about a year ago about EDNS(0) relative to >> client sub-netting (ECS) and it was not embraced due to the fact that it >> made their recursion jump by several orders of magnitude and broke the DNS >> system. Not sure if they plan to use EDNS(0) for other things, but not >> sure how that would factor into the load on the caches and need to spread >> that load via additional IP responses, but please educate me if you know >> something about this. >> >> In an ideal world TR monitors the popularity of a service based on >> incoming request counts per second and potentially expands or contracts IP >> response. Given DNS caching that may be difficult to judge accurately, but >> we may be able to use it to differentiate between a “1” and “4” response. >> I thought I cut a request for that a while back, but I can’t find it so I >> created a new one: https://github.com/apache/incubator-trafficcontrol/ >> issues/1614 >> >> Ryan Durfey M | 303-524-5099 >> CDN Support (24x7): 866-405-2993 or [email protected]<mailto: >> [email protected]> >> >> >> From: "Eric Friedrich (efriedri)" <[email protected]> >> Reply-To: "[email protected]" < >> [email protected]> >> Date: Monday, December 4, 2017 at 6:18 PM >> To: "[email protected]" < >> [email protected]>, "[email protected]" < >> [email protected]> >> Subject: Re: Changing max_dns_answers default >> >> Does EDNS0 (which TR already supports) reduce the severity of this >> problem? If so, could TR do an auto detection on if the sending resolver >> supports EDNS0 when deciding how big to make the response? >> >> —Eric >> >> On Dec 4, 2017, at 5:31 PM, Jason Tucker <[email protected]<mailto: >> [email protected]>> wrote: >> HTTP-routing seems to go to the opposite end of the spectrum - the default >> is to use a dispersion of "1", which gives best cache efficiency as Ryan >> mentions. I think the behavior in this regard should be somewhat similar >> between HTTP and DNS routing. >> __Jason >> On Mon, Dec 4, 2017 at 10:19 PM, Durfey, Ryan <[email protected]< >> mailto:[email protected]>> >> wrote: >> I like the idea of code that makes it always under the threshold and I >> think this is a good feature to add, but from a practical perspective we >> always want the max dns response to be the minimum viable for cache >> efficiency. Most of our services (95%+) should be set to 1, 2, 3, or 4 >> correlated to throughput of the service. Making the default set to as many >> as possible ensures that unless you are paying close attention you will >> have terrible cache efficiency. I would advocate for 2 or 3 since this >> would cover the majority of our services, keep cache efficiency reasonable, >> and work for most other applications as well. I would also advocate to add >> the threshold check in case someone goes too high or sets it to 0. >> *Ryan Durfey* M | 303-524-5099 <(303)%20524-5099> >> CDN Support (24x7): 866-405-2993 <(866)%20405-2993> or >> [email protected]<mailto:[email protected]> >> *From: *Jason Tucker <[email protected]<mailto:[email protected] >> >> >> *Reply-To: *"[email protected]<mailto:de >> [email protected]>" < >> [email protected]<mailto:dev@ >> trafficcontrol.incubator.apache.org>>, "[email protected]<mailto: >> [email protected]>" < >> [email protected]<mailto:[email protected]>> >> *Date: *Monday, December 4, 2017 at 3:10 PM >> *To: *Phil Sorber <[email protected]<mailto:[email protected]>> >> *Cc: *"[email protected]<mailto:de >> [email protected]>" < >> [email protected]<mailto:dev@ >> trafficcontrol.incubator.apache.org>> >> *Subject: *Re: Changing max_dns_answers default >> I can't comment on the development effort for that (or the compute / >> latency overhead that it might add to TR), but I think having a default >> variable that could be set per TC installation doesn't seem unreasonable. >> __Jason >> On Mon, Dec 4, 2017 at 9:11 PM, Phil Sorber <[email protected]<mailto:sorb >> [email protected]>> wrote: >> What about adding code that would count the bytes dynamically and make >> sure it keeps under the threshold? Maybe even make that the behavior for >> the current default of 0. >> On Mon, Dec 4, 2017 at 2:06 PM Jason Tucker <[email protected]< >> mailto:[email protected]>> >> wrote: >> Yes, this is the UDP thing. We've had customers with clients that sit >> behind DNS infrastructure that has problems with large response packets. >> However, the "max" is going to be installation dependent, though. >> Variables >> such as edge hostname convention, and CDN DNS domain suffixes are going to >> cause that threshold to vary from installation to installtion. If you have >> short FQDNS, you can fit many of them in a single UDP response. >> __Jason >> On Mon, Dec 4, 2017 at 9:00 PM, Phil Sorber <[email protected]<mailto:sorb >> [email protected]>> wrote: >> You say it causes issues with "large cache groups". What is "large" in >> this >> context? Maybe we should pick a default that puts us slightly below >> that. >> Reading a little into your comment here, I assume the "problems" stems >> from >> the number of answers that fit in a UDP packet. Maybe we should just >> make >> the default below that threshold so we get as close to the max without >> causing said problems? >> Thanks. >> On Mon, Dec 4, 2017 at 12:52 PM Volz, Dylan <[email protected]< >> mailto:[email protected]>> >> wrote: >> Hi All, >> The max_dns_answers has been defaulted to 0, which is an unlimited >> number >> of answers, which causes issues for deployments with large cache >> groups. >> I >> opened a PR (1611< >> https://github.com/apache/incubator-trafficcontrol/pull/1611>< >> https://github.com/apache/incubator-trafficcontrol/pull/1611%3e>) to >> change >> the default from 0 to 5 which is hopefully a sensible value for most >> deployments. If this doesn’t seem like a sensible default please >> respond >> with alternatives. >> Thanks, >> Dylan >> >> >>
