On 8/08/2019 4:46 pm, Remi Gacogne wrote:
That's actually one of the most readable configuration I have seen in a
while, don't worry ;-)
Good to know, still got some work to do on it to make it more friendly
That's very weird, I don't see anything unusual in your configuration,
the backtrace seems to indicate that all threads are working as
expected, and I even see some UDP queries being received and forwarded
in the strace (albeit very few, you can spot them easily by looking for
"recvmsg resumed" with grep).
That is strange. When the issue occurs it will receive minimal traffic
except from the health checking service that controls the IP's being
announced with BGP.
I just noticed there is even more strange behavior. I restarted the
dnsdist instance and sent traffic for it to reproduce the issue. While
it was working I made a 'ANY' query for google.com. One the issue
occured I could still send that query and get an answer (both with UDP
and TCP). Queries for things that were not in the cache I guess is what
Would you mind providing a 'lsof -n -p <pid of dnsdist>' while it's
The lsof output is available here:
Would you by any chance be able to do a strace when it's stuck,
while at the same time sending a few UDP queries to it, ideally with an
easily recognizable qname like "why-is-dnsdist-not-responding.to.this." ?
The stack trace is available here:
During the stack trace I performed 4 requests (in order)
- UDP A request for why-is-dnsdist-not-responding.to.this. (not working)
- TCP A request for why-is-dnsdist-not-responding.to.this. (working)
- UDP ANY request for google.com (working)
- UDP A request for google.com (not working)
Do you collect some metrics via prometheus? I don't see a carbon export,
you might want to send some metrics to our public metronome server 
for a while, just from one box, we might some spot something there.
I'll configure this shortly to the public metronome server.
Also, apart from Debian being upgraded from Stretch to Buster and
dnsdist from 1.3.x to 1.4.0-beta2, did anything else change in your setup?
To be clear, I actually installed a new copy of Debian, I didn't upgrade
the existing stretch install.
The dnsdist configuration changed slightly:
- I originally wrote a lua function for load balancing. Now I am using
poolAvailable with rules so I can use a built in method.
- The rules were tidied up a bit, previously each dnsdist instance had
left over rules that were no longer required
- The cache sizes were adjusted
dnsdist mailing list