Sanitizers -- status & how to use?

Havard Eidnes Wed, 30 Oct 2024 11:19:20 -0700

Hi,

I am (for work) managing a multi-node anycast DNS resolver
cluster using BIND, with NetBSD/amd64 10.0 as the platform, and
use exabgp to announce the service address of this cluster once a
simple sanity check of the local DNS recursor succeeds. This
mostly works quite well.


A few months ago I decided to crank up the level of TCP-based
queries, by implementing RFC9462, i.e. publishing SVCB records
for _dns.resolver.arpa pointing to the DNS-over-TLS and
DNS-over-HTTPS service endpoints that I have configured BIND to
serve.  It is evident that quite a number of clients then switch
over to doing either DoT or DoH instead of plain old DNS over
port 53.

The peak daily DNS query volume of this particular single node is
around 2.500 qps, while the TCP-based volume peaks at around
7-800 qps.

However, after doing this, I have noticed that BIND occasionally
decides to crash and dump core.  When I say "occasionally", that
is indeed the case -- sometimes there can be 4 weeks or more
between incidents.  I have set up my system and chroot to collect
core dumps of set-id programs (such as BIND), and have a recent
core dump.

Now, ISC doesn't actually officially provide support for NetBSD
as an OS for BIND, but they have been quite helpful in following
up my bug report, especially given the official support status of
the OS, and since I do not think anyone else is observing this
problem, I have been pointed in the direction of trying to build
and run BIND with a thread sanitizer, because some of what we see
is a "magic" field in a struct gets a value it's not supposed to
have, and the crash actually happens inside libpthread, ref:

(gdb) where
#0  0x00007c3a2d009d40 in ?? () from /usr/lib/libpthread.so.1
#1  0x00007c3a2d00a360 in pthread_mutex_unlock () from /usr/lib/libpthread.so.1
#2  0x00007c3a31837013 in clean_finds_at_name (name=0x7c39ef757f90, 
    astat=astat@entry=DNS_ADB_MOREADDRESSES, addrs=addrs@entry=2) at adb.c:950
...

and

(gdb) i reg rbx
rbx            0x6720f75a          1730213722
(gdb) i reg rip
rip            0x7c3a2d009d40      0x7c3a2d009d40
(gdb) x/i 0x7c3a2d009d40
=> 0x7c3a2d009d40:        mov    (%rbx),%r15
(gdb) x/x 0x6720f75a
0x6720f75a:       Cannot access memory at address 0x6720f75a
(gdb) 

so it gets a SEGV.

The issue has been observed both with recent-ish versions of BIND
9.18.x and now also with BIND 9.20.2, the former built directly
from source, the latter from pkgsrc-wip.

Given the target, this means gcc 10.5.0 ("nb3") is the default
compiler.  I think I'm able to read the gcc man page, which
points towards -fsanitize=thread, and that it cannot be combined
with either of the "address" or "leak" sanitizers, and googling a
bit brings me to

   https://wiki.netbsd.org/users/kamil/sanitizers/

which is now 5 years old, and does not list "thread sanitizer"
status for gcc, and is now somewhat ... old.

And ... I got the impression that there is often some required
rather finicky platform and/or OS-specific work which is needed
to be done to make these features work properly, and I am
uncertain of this implementation status for "us".

So ... can anyone tell me the current implementation status of
these features for NetBSD/amd64 10.0, so that I don't expend a
lot of energy following a path which turns out to be a dead end?

E.g. do I stand a better chance of getting the thread sanitizer
working if I convert from building BIND with gcc to building it
with clang?

Then there is the issue of how to run a program which normally
daemonizes with any of the sanitizers.  I am guessing the error
messages from the sanitizers end up on stderr, so BIND needs to
be run in the foreground, and I should arrange to capture any
output?

And ... should I run the executable under gdb, or is that not
required, recommended or is that purely optional?

Since the problem surfaces so seldom, it would be a benefit to
get all this right on the first go-around.

Also ... what should I expect in terms of resource consumption
when it comes to CPU and memory for any of these features?

Some of this appears to be answered by

  https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual

but again, the implementation status is as of 2018, and is based
on the use of clang / LLVM and not gcc.  And 2-20x CPU time I
could possibly cope with (especially at the lower range), but
5-10x memory will be harder with currently deployed hardware
(which we can cope with, but will take a while).


Many thanks in advance for any good up-to-date hints about this.


Best regards,

- Håvard

Sanitizers -- status & how to use?

Reply via email to