Hi, I am (for work) managing a multi-node anycast DNS resolver cluster using BIND, with NetBSD/amd64 10.0 as the platform, and use exabgp to announce the service address of this cluster once a simple sanity check of the local DNS recursor succeeds. This mostly works quite well.
A few months ago I decided to crank up the level of TCP-based queries, by implementing RFC9462, i.e. publishing SVCB records for _dns.resolver.arpa pointing to the DNS-over-TLS and DNS-over-HTTPS service endpoints that I have configured BIND to serve. It is evident that quite a number of clients then switch over to doing either DoT or DoH instead of plain old DNS over port 53. The peak daily DNS query volume of this particular single node is around 2.500 qps, while the TCP-based volume peaks at around 7-800 qps. However, after doing this, I have noticed that BIND occasionally decides to crash and dump core. When I say "occasionally", that is indeed the case -- sometimes there can be 4 weeks or more between incidents. I have set up my system and chroot to collect core dumps of set-id programs (such as BIND), and have a recent core dump. Now, ISC doesn't actually officially provide support for NetBSD as an OS for BIND, but they have been quite helpful in following up my bug report, especially given the official support status of the OS, and since I do not think anyone else is observing this problem, I have been pointed in the direction of trying to build and run BIND with a thread sanitizer, because some of what we see is a "magic" field in a struct gets a value it's not supposed to have, and the crash actually happens inside libpthread, ref: (gdb) where #0 0x00007c3a2d009d40 in ?? () from /usr/lib/libpthread.so.1 #1 0x00007c3a2d00a360 in pthread_mutex_unlock () from /usr/lib/libpthread.so.1 #2 0x00007c3a31837013 in clean_finds_at_name (name=0x7c39ef757f90, astat=astat@entry=DNS_ADB_MOREADDRESSES, addrs=addrs@entry=2) at adb.c:950 ... and (gdb) i reg rbx rbx 0x6720f75a 1730213722 (gdb) i reg rip rip 0x7c3a2d009d40 0x7c3a2d009d40 (gdb) x/i 0x7c3a2d009d40 => 0x7c3a2d009d40: mov (%rbx),%r15 (gdb) x/x 0x6720f75a 0x6720f75a: Cannot access memory at address 0x6720f75a (gdb) so it gets a SEGV. The issue has been observed both with recent-ish versions of BIND 9.18.x and now also with BIND 9.20.2, the former built directly from source, the latter from pkgsrc-wip. Given the target, this means gcc 10.5.0 ("nb3") is the default compiler. I think I'm able to read the gcc man page, which points towards -fsanitize=thread, and that it cannot be combined with either of the "address" or "leak" sanitizers, and googling a bit brings me to https://wiki.netbsd.org/users/kamil/sanitizers/ which is now 5 years old, and does not list "thread sanitizer" status for gcc, and is now somewhat ... old. And ... I got the impression that there is often some required rather finicky platform and/or OS-specific work which is needed to be done to make these features work properly, and I am uncertain of this implementation status for "us". So ... can anyone tell me the current implementation status of these features for NetBSD/amd64 10.0, so that I don't expend a lot of energy following a path which turns out to be a dead end? E.g. do I stand a better chance of getting the thread sanitizer working if I convert from building BIND with gcc to building it with clang? Then there is the issue of how to run a program which normally daemonizes with any of the sanitizers. I am guessing the error messages from the sanitizers end up on stderr, so BIND needs to be run in the foreground, and I should arrange to capture any output? And ... should I run the executable under gdb, or is that not required, recommended or is that purely optional? Since the problem surfaces so seldom, it would be a benefit to get all this right on the first go-around. Also ... what should I expect in terms of resource consumption when it comes to CPU and memory for any of these features? Some of this appears to be answered by https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual but again, the implementation status is as of 2018, and is based on the use of clang / LLVM and not gcc. And 2-20x CPU time I could possibly cope with (especially at the lower range), but 5-10x memory will be harder with currently deployed hardware (which we can cope with, but will take a while). Many thanks in advance for any good up-to-date hints about this. Best regards, - HÃ¥vard