> On our production name servers we have check every 30s if bind
> is alive by sending a SOA query to bind. Today I upgraded a few
> nodes from 9.18.x (x between 17 and 27) to 9.20.1 (Ubuntu 24.04
> with packages from ISC ppa).
>
> Since that, we have sporadic timeouts (3s). On the nodes with
> more qps we see it more often.
>
> Before I dig into the problem, are there any specific changes
> to 9.20 that I should look at? Maybe some default value changes
> for socket buffers, thread handling ...?

I can't answer specifically about BIND 9.20, I'm currently
tipping my toes carefully into the waters of "deploying BIND 9.20
as a recursor".

What you don't say anything about is whether you see increased
CPU load on your hosts, and whether the relationship between QPS
and CPU load has changed after upgrading to 9.20.  Also, what
general level of load do you observe on this / these host(s)?
E.g. "how close to the limit of what it can do" are you?


In our deployment, we monitor the relationship between the number
of "udp: dropped due to full socket buffers" and "udp: datagrams
received" (in our case via collectd / graphite / grafana), and
when we started doing that we found out that we needed to bump
the default UDP socket buffers quite a bit to get that event rate
to go down to acceptable rates.  Regrettably, as far as I know,
BIND does not have a knob to adjust the socket buffer size for
the UDP sockets BIND itself use, so what I ended up doing was
bumping the default for UDP sockets the entire host via sysctl.
In my case that's "fine" because the host is basically only
serving this single function.

Then again, I'm the weirdo running BIND on NetBSD, so the
defaults are probably widely different in your case.

Just an example from one of our publishing (non-recursive) BIND
servers, from "netstat -s" output:

udp:
        1669688117 datagrams received
        0 with incomplete header
        10 with bad data length field
        994 with bad checksum
        10922 dropped due to no socket
        874709 broadcast/multicast datagrams dropped due to no socket
        890955 dropped due to full socket buffers
        1667910527 delivered
        2741883224 PCB hash misses
        1632037948 datagrams output

which comes out to 0.05% as an overall average "drops due to full
socket buffers", but that doesn't mean there are occasional
(smallish) spikes in the rate, of course.  And this is with BIND
9.18.29.

In other words: I think more information is needed to help you
diagnose the issue.


Regards,

- Håvard
-- 
Visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe from 
this list

ISC funds the development of this software with paid support subscriptions. 
Contact us at https://www.isc.org/contact/ for more information.


bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Reply via email to