Hi Remi,

On 8/08/2019 4:46 pm, Remi Gacogne wrote:
That's actually one of the most readable configuration I have seen in a
while, don't worry ;-)

Good to know, still got some work to do on it to make it more friendly though.

That's very weird, I don't see anything unusual in your configuration,
the backtrace seems to indicate that all threads are working as
expected, and I even see some UDP queries being received and forwarded
in the strace (albeit very few, you can spot them easily by looking for
"recvmsg resumed" with grep).

That is strange. When the issue occurs it will receive minimal traffic except from the health checking service that controls the IP's being announced with BGP.

I just noticed there is even more strange behavior. I restarted the dnsdist instance and sent traffic for it to reproduce the issue. While it was working I made a 'ANY' query for google.com. One the issue occured I could still send that query and get an answer (both with UDP and TCP). Queries for things that were not in the cache I guess is what stopped working.

Would you mind providing a 'lsof -n -p <pid of dnsdist>' while it's

The lsof output is available here:


Would you by any chance be able to do a strace when it's stuck,
while at the same time sending a few UDP queries to it, ideally with an
easily recognizable qname like "why-is-dnsdist-not-responding.to.this." ?

The stack trace is available here:


During the stack trace I performed 4 requests (in order)

- UDP A request for why-is-dnsdist-not-responding.to.this. (not working)
- TCP A request for why-is-dnsdist-not-responding.to.this. (working)
- UDP ANY request for google.com (working)
- UDP A request for google.com (not working)

Do you collect some metrics via prometheus? I don't see a carbon export,
you might want to send some metrics to our public metronome server [1]
for a while, just from one box, we might some spot something there.

I'll configure this shortly to the public metronome server.

Also, apart from Debian being upgraded from Stretch to Buster and
dnsdist from 1.3.x to 1.4.0-beta2, did anything else change in your setup?

To be clear, I actually installed a new copy of Debian, I didn't upgrade the existing stretch install.

The dnsdist configuration changed slightly:

- I originally wrote a lua function for load balancing. Now I am using poolAvailable with rules so I can use a built in method. - The rules were tidied up a bit, previously each dnsdist instance had left over rules that were no longer required
- The cache sizes were adjusted

