Re: BIND slave server ignoring responses to all UDP-based SOA queries (zone refresh) for hours at a time

Cathy Almond Tue, 07 Jul 2015 05:48:32 -0700

What can happen (and this is really really subtle) is that if there are
some source ports that named could randomly select, but where
intermediate firewalls or filters are just dropping, either the SOA
refresh queries, or the responses, then named can 'get stuck' on using
and re-using the same refresh source port.


It doesn't happen if the attempt to send using that port fails, or if an
ICMP error comes back that is appropriately passed up to the socket
later and reaches BIND as a receive error.

The reason for this, as mentioned above, is subtle.  This is that for
SOA refresh queries (but not for regular recursion), when named needs to
send the query, it will look to see if there's one already in progress
to the same destination, and if yes, then it will piggy-back and use the
same internal structures and socket.  (This was done for performance
reasons to handle those occasions where there is a very high spike in
zone refreshes.)

Thus, if there's a pending SOA refresh query already for another zone,
named will send the new SOA refresh query using the same source port.

I think you can see where this is going?

Where you have a slave that has a large number of zones with frequent
updates, and when it has unfortunately picked at random, a 'broken'
port, then nearly all the time, there will be one or more pending SOA
refreshes waiting to time out, and any new ones that come along are
going to suffer the same fate... hanging around and causing the next new
one to...

The solution is to identify that this is happening, figure out which
ranges of UDP ports named should or shouldn't use, and then configure
them in named.conf.  You have these options available to you:

[ use-v4-udp-ports { port_list }; ]
[ avoid-v4-udp-ports { port_list }; ]
[ use-v6-udp-ports { port_list }; ]
[ avoid-v6-udp-ports { port_list }; ]

Hope this helps?

Cathy


On 04/06/2015 19:36, Irwin Tillman wrote:
> Apologies in advance for this lengthy description.
> 
> Since making I made a configuration change a few weeks ago, every 1-3 days, my
> BIND 9.9.7 server experiences several hours of retry/timeout failures while
> performing UDP-based SOA serial number queries (zone refresh).
> 
> My server acts like it doesn't hear master servers' responses to my server's
> UDP-based SOA queries.  For each such UDP-based SOA query that times out and 
> retries several times,
> BIND then falls back to using TCP for the query; it "hears" that TCP-based 
> answer.
> 
> This affects all the zones for which my server acts as a slave.
> 
> For example, focusing on just a single slave zone, my server logs (names and 
> addresses anonymized):
> 
> May 30 14:56:23 foo.example.com named[16217]: general: info: zone 
> 1.10.in-addr.arpa/IN/default: refresh: retry limit for master 192.168.1.2#53 
> exceeded (source 0.0.0.0#0)
> May 30 14:56:23 foo.example.com named[16217]: general: info: zone 
> 1.10.in-addr.arpa/IN/default: Transfer started.
> May 30 14:56:23 foo.example.com named[16217]: xfer-in: info: transfer of 
> '1.10.in-addr.arpa/IN/default' from 192.168.1.2#53: connected using 
> 172.16.3.4#63325
> May 30 14:56:23 foo.example.com named[16217]: general: info: zone 
> 1.10.in-addr.arpa/IN/default: transferred serial 11707064: TSIG 
> 'key-foo-bar.example.com'
> May 30 14:56:23 foo.example.com named[16217]: xfer-in: info: transfer of 
> '1.10.in-addr.arpa/IN/default' from 192.168.1.2#53: Transfer completed: 1 
> messages, 52 records, 9065 bytes, 0.146 secs (62089 bytes/sec)
> 
> --
> 
> Each time my server gets into this state:
> 
> * It affects all the slave zones my server is trying to refresh.
> It doesn't affect only some of the zones due for refresh.
> 
> Throughout the incident, the messages above are logged for all those slave
> zones, not just selected ones.
> 
> * It's not confined to only those slave zones my server is trying to refresh 
> from some masters.
> Some of the masters my server pulls from are the source of dozens of zones.
> Some masters are the source of just one or two zones I pull.
> Some of the masters are local, while some are across the Internet.
> The issue affects refresh checks to all of the masters; it's not specific to 
> just some of the masters from which we pull.
> 
> * Each time this happens, it lasts for an extended period. 
> I've seen this range from 3 hours to 13 hours.
> The problem seems to appear and then vanish "on its own."
> 
> * It generally happens every one to three days.
> 
> * While it's happening, it doesn't affect just "some" zone refresh UDP-based 
> SOA query/responses.
> It affects *all* of them.
> 
> * While it's happening, it doesn't affect zone refresh TCP-based SOA 
> query/responses.
> Those proceed fine.
> 
> * While it's happening, it doesn't affect handling of incoming NOTIFY 
> messages.
> My server continues to see incoming NOTIFY messages (UDP).
> 
> If the message is from an appropriate master and specifies a serial number
> higher than my server has, it correctly triggers my server to pull an updated 
> zone.
> If the serial number is not higher, my does not pull an updated zone.
> If the messages is not from an appropriate master, my server ignores it.
> 
> * While this is happening, and my server is in the middle of a UDP-based SOA 
> query retry/timeout
> for a zone, my server receives NOTIFY messages from various legitimate 
> masters for the zone.
> My server (reasonably) logs:
> 
> May 30 14:55:48 foo.example.com named[16217]: notify: info: client 
> 192.168.1.3#62867/key key-foo-bar.example.com: view default: received notify 
> for zone '1.10.IN-ADDR.ARPA': TSIG 'key-foo-bar.example.com'
> May 30 14:55:48 foo.example.com named[16217]: general: info: zone 
> 1.10.in-addr.arpa/IN/default: notify from 192.168.1.3#62867: refresh in 
> progress, refresh check queued
> 
> --
> 
> ./lib/dns/zone.c refresh_callback() shows that the "refresh: retry limit for 
> master .. exceeded"
> message above is logged after the server has tried to query a master for the 
> SOA (using EDNS) over UDP
> and failed (after retries).  After logging this message, the code goes on to 
> try using TCP.
> The messages logged above imply that this succeeded.
> 
> During one of these events, I performed packet capture to observe
> the refresh activity.  The packet capture shows my server sending the SOA 
> query
> to the master using UDP, and the master replying a fraction of a second later.
> The master's reply looks good to me.
> 
> After a timeout, this repeats (as if the server didn't receive the master's 
> reply, or didn't like it).
> This happens several times, then the server logs the "refresh: retry limit 
> for master ... exceeded" message above.
> The my server tries the SOA query over TCP, gets a reply, and now appears 
> satisified.
> 
> --
> 
> While my server is in this state (for some hours at a time),
> I observe it is otherwise receiving DNS queries from clients and sending
> responses to clients normally, both over UDP and TCP.
> 
> --
> 
> It seems my server is able to send and receive using both UDP and TCP during 
> these events.
> It's clearly hearing *some* of the incoming UDP-based DNS messages.
> 
> It's only acting like it doesn't hear (or doesn't like) the incoming 
> UDP-based 
> SOA responses to its UDP-based SOA queries.
> 
> It's not apparent to me what causes the server to enter this state,
> or to return to normal some hours later.
> 
> --
> 
> Platform:
> 
> BIND 9.9.7 (also verified it happens in 9.9.6-P1)
> Solaris 10 on SPARC 
> authoritative-only server
> 119 zones (42 master, 77 slave)
> 1 view
> 
> 
> RRL is in-use.
> All the masters my server pulls from are exempt from RRL;
> 
> acl masters-we-pull-from {
>    192.168.1.2; 192.168.1.3; ...
> };
> 
> acl clients-exempt-from-rate-limiting {
>    ...
>    masters-we-pull-from;
>    ...
> };
> 
> ...
> 
> options {
>    ...
>    rate-limit { clients-exempt-from-rate-limiting; };
>    ...
> };
> 
> RRL logs each time it rate-limits any netblock.
> It is not logging any rate-limiting being enforced on masters my server pulls 
> from.
> 
> 
> --
> 
> This is a DNS server that's been running for years without this issue.
> 
> The problem began three days after I made a configuration change to my server.
> As that was the only recent change, I imagine this configuration change might 
> be what triggered the
> sporadic problematic behavior. (I cannot test my guess by temporarily backing 
> out that change; 
> a customer is relying on that configuration change.)
> 
> My configuration change was:
> 
> My old default was to NOT default to request-ixfr, and only request-ixfr for 
> specific slave zones.
> My new default is to default to request-ixfr, and to disable request-ixfr for 
> a specific master server.
> 
> So on my server I changed:
> 
> In global "options":
> 
> Comment out:
>    request-ixfr no;
> Add:
>    max-journal-size 100M;
> 
> For one master server I pull slave zones from, add to its "server" statement:
>    request-ixfr no;
> 
> For several zones I am a slave for, add to each "zone" statement
> (to document my intent that this zone attempt IXFR regardless of
> the global setting):
>     request-ixfr yes;
> 
> I needed to make this change because some of the zones for which I slave
> change so often, and are so large, that using AXFR caused my server to lag
> well behind the zones' other servers.
> 
> --
> 
> I'll mention (without knowing whether it is relevant) that:
> 
> * my server receives lots of NOTIFY messages it refuses because they come 
> from non-masters
> 
> * because some of the zones my server slaves for are changed very often,
> my server performs IXFR very often.  
> 
> 
> --
> 
> Any ideas?
> 
> Irwin Tillman
> OIT Networking & Monitoring Systems, Princeton University
> 
> _______________________________________________
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
> from this list
> 
> bind-users mailing list
> bind-users@lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users
> 

_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: BIND slave server ignoring responses to all UDP-based SOA queries (zone refresh) for hours at a time

Reply via email to