On 21 Oct 2016, at 16:53, John Todd wrote:
As most of you know by now, today’s DynDNS outage due to DDoS attack
caused fairly widespread outages across a large number of domains.
Authoritative resolvers seem to be a particularly interesting target
for attackers as they are often smaller in scope (IP address range,
transit size of authoritative resolver networks) than a full service
offering by a provider of multiple other services like HTTP. It seems
that there may be some reasonable ways to respond to outages like this
which at a minimum will result in failures that are less “bad”
than having no replies at all, and which can be implemented by DNS
recursors.
I’d like to propose an extension to PowerDNS Recursor for mitigating
(partially) events like we had today where major authoritative
nameservers were put out of commission. This might be a particularly
foolish or error-prone method - it only took me a few minutes to think
up. But I’d at least like to hear a discussion as to why this
isn’t a good idea. The comment of “But this might end up giving
out the wrong answer!” is true, but I view a wrong answer as better
than no answer. What would a domain operator USUALLY want to get?
They’d want to get the inbound connection, rather than having users
completely offline. This seems to be particularly valuable for TLD and
other low-churn zones which may come under attack for various
political reasons but which contain a significant number of NS
records.
Having done plenty of OSS work, I’m sure the next comment will be
“patches welcome.” ;-) I would be happy to pay some small amount
of dollars to someone to write this, but I have little budget, high
hopes, and no coders on staff at this level yet otherwise I would do
just that.
PowerDNS Recursor proposed feature extensions:
servfail-ttl-override
* Integer
* Default: 180
The recursor keeps all records for this amount of seconds after TTL
expiration. If the authoritative-provided TTL has expired, then lookup
is performed on the query in a normal way. If that query fails due to
a SERVFAIL, then the TTL timer on this “old” record is set back to
zero and the “old” record is provided as a response. If an
authoritative server is marked as “down” due to repeated SERVFAIL
responses (see packetcache-servfail-ttl) then the “old” record is
handed back immediately without a new query attempt, and the TTL timer
is set back to zero to keep the answer in a state of perpetual
validity as long as there are active queries occurring within the
servfail-ttl-override interval and the authoritative server is
resulting in SERVFAIL. (packetcache-servfail-ttl is on a rotating
timer, and will try every X seconds, leading to one single query
getting delays during the next attempt cycle - other queries are
immediately replied to with the “old” answer.) An NXDOMAIN
response from an authoritative server clears “old” records in
memory immediately.
This timer method is useful in situations where authoritative
nameservers are being DDoS’ed and cannot provide responses, with the
intent that some answer is better than no answer. If a domain operator
wishes to stop traffic to their site, then replies with NXDOMAIN
negate this behavior. Only a nameserver being unreachable will result
in this cache being used as a last resort, and there is a timer for
maximum duration of these old records being kept. Setting this value
low will mean that highly-traffic’ed websites will typically always
reply with a result even if the authoritative nameservers are
unreachable due to attack or network disconnect, but less
often-queried domains may be removed from the cache leading to query
failures. Setting this value high may lead to unexpected results for
infrequently-used domains which have dynamic results.
servfail-ttl-override-domain-exceptions
* Domains, comma separated
List of domains on which we never use the servfail-TTL-override method
servfail-ttl-override-server-exceptions
* IP addresses, comma separated
List of authoritative servers on which we never use the
servfail-TTL-override method
JT
After some thought in the shower this morning, I think I need to update
my original proposal. Instead of the refreshed timer being the TTL of
the original record, the new TTL should be set to be
packetcache-servfail-ttl. This means that a refreshed record will only
stay in the cache as long as the authoritative server is unreachable.
JT
_______________________________________________
Pdns-dev mailing list
Pdns-dev@mailman.powerdns.com
https://mailman.powerdns.com/mailman/listinfo/pdns-dev