RE: Bind9 stops responding for some clients

2019-06-06 Thread Browne, Stuart via bind-users
Congratulations on finding the cause.

Sometimes, it's the simplest of things.

Stuart

From: bind-users [mailto:bind-users-boun...@lists.isc.org] On Behalf Of Gregory 
Sloop
Sent: Thursday, 6 June 2019 12:37 PM
To: bind-users@lists.isc.org
Subject: Re: Bind9 stops responding for some clients

Thanks for the idea.
I did resolve this a day or two ago.

The story is;
This server was a fairly recent replacement for an older Ubuntu setup. The new 
server as well as the old one are/were VM's - yet on different VM platforms. 
The old VM was turned off, and was marked never to start except unless manually 
started. [There were a few other things on the VM host that had yet to be 
migrated - so we didn't want it entirely off quite yet.]

The problem happened again in the last day or two - and packet captures showed 
that no packets were even arriving at the new VM.
Since there really wasn't anything that should be blocking that traffic, I 
checked the arp table on a problem client.
The arp table showed an "incorrect" MAC address for the current BIND server. 
[The MAC in the arp table didn't match the MAC for the new VM.]

While I didn't have the MAC address for the "old" deactivated server handy, it 
was the first obvious problem/solution to check.
Sure enough, after connection to the VM hypervisor, I could see that the "old" 
BIND vm was active.

I killed it, and service returned to normal.

So, the "solution" was pretty routine.
What made it more "interesting" and perhaps odd is how seemingly randomly the 
problem would crop up.
And it would only impact some clients, not all. There was no pattern that 
seemed to explain why some got the current/correct BIND server and others 
didn't. [The arp poisoning certainly wasn't anywhere nearly universal.]
And why was it so infrequent - it would go many days between issues.
I have to assume the bad VM had been up for some time, at least since the 
problems started.
There are quite a number of odd-ish other things too, but not worth detailing.

Probably it's just one of those "undefined" situations where you can't 
anticipate some predictable order to what happens when you screw it up. Rather 
than burn additional time trying to grok what was going on - it's simply best 
to say "don't do that - bad things happen, though I can't say what bad things 
will happen and in which logical order. They just will - so DON'T DO THAT!"

[And yeah, I obviously knew all about not doing that. But it happened anyway, 
in spite of specific steps to prevent it. I'm still not sure why.]

In the end, it's a somewhat complicated story with a very obvious cause - but 
it wasn't so clear at the outset.

TLDR version;
Don't run your old and new bind servers on the same IP address - ether by 
accident or intentionally. Bad stuff will happen!
It might be really odd, or it might be plain as day -  but in either case it 
won't be good! :)

Thanks all for the suggestions!
Here's hoping I don't need to ask for BIND assistance for another 20 years! :)

-Greg


I just randomly spotted this post, and thought I would toss in 2¢

How many nics and how many it's are on the servers?  Are the failing clients on 
the same subnet as the server?

--


___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Bind9 stops responding for some clients

2019-06-05 Thread Gregory Sloop
Thanks for the idea.
I did resolve this a day or two ago.

The story is; 
This server was a fairly recent replacement for an older Ubuntu setup. The new 
server as well as the old one are/were VM's - yet on different VM platforms. 
The old VM was turned off, and was marked never to start except unless manually 
started. [There were a few other things on the VM host that had yet to be 
migrated - so we didn't want it entirely off quite yet.]

The problem happened again in the last day or two - and packet captures showed 
that no packets were even arriving at the new VM.
Since there really wasn't anything that should be blocking that traffic, I 
checked the arp table on a problem client. 
The arp table showed an "incorrect" MAC address for the current BIND server. 
[The MAC in the arp table didn't match the MAC for the new VM.]

While I didn't have the MAC address for the "old" deactivated server handy, it 
was the first obvious problem/solution to check.
Sure enough, after connection to the VM hypervisor, I could see that the "old" 
BIND vm was active.

I killed it, and service returned to normal.

So, the "solution" was pretty routine.
What made it more "interesting" and perhaps odd is how seemingly randomly the 
problem would crop up. 
And it would only impact some clients, not all. There was no pattern that 
seemed to explain why some got the current/correct BIND server and others 
didn't. [The arp poisoning certainly wasn't anywhere nearly universal.]
And why was it so infrequent - it would go many days between issues.
I have to assume the bad VM had been up for some time, at least since the 
problems started.
There are quite a number of odd-ish other things too, but not worth detailing.

Probably it's just one of those "undefined" situations where you can't 
anticipate some predictable order to what happens when you screw it up. Rather 
than burn additional time trying to grok what was going on - it's simply best 
to say "don't do that - bad things happen, though I can't say what bad things 
will happen and in which logical order. They just will - so DON'T DO THAT!"

[And yeah, I obviously knew all about not doing that. But it happened anyway, 
in spite of specific steps to prevent it. I'm still not sure why.]

In the end, it's a somewhat complicated story with a very obvious cause - but 
it wasn't so clear at the outset.

TLDR version; 
Don't run your old and new bind servers on the same IP address - ether by 
accident or intentionally. Bad stuff will happen! 
It might be really odd, or it might be plain as day -  but in either case it 
won't be good! :)

Thanks all for the suggestions! 
Here's hoping I don't need to ask for BIND assistance for another 20 years! :)

-Greg



I just randomly spotted this post, and thought I would toss in 2¢

How many nics and how many it's are on the servers?  Are the failing clients on 
the same subnet as the server?

--
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Bind9 stops responding for some clients

2019-06-05 Thread Gordon Lang
I just randomly spotted this post, and thought I would toss in 2¢

How many nics and how many it's are on the servers?  Are the failing
clients on the same subnet as the server?

--
Gordon A. Lang

On Thu, May 30, 2019, 8:10 PM Gregory Sloop  wrote:

> So, this is a very odd situation and I'm kind of grasping at straws here.
> So, I've come to see if any of you have any good straws!
>
> The setup.
> ---
> Ubuntu 18.04 LTS is the distro we're running on.
> All software is packaged [from the distro] - not compiled from sources.
> Bind9 acting as a recursive resolver for a smallish network. 150 seats.
> They're also handling DHCP and Chrony/NTP requests.
> [I actually have a pair of these handling DNS/DHCP/NTP this is the master.]
>
> They are running on a Xen/XCP VM.
>
> The one I'm having problems is the master for several internal zones - the
> one that's working fine is the slave for those same zones. None of the
> zones are large.
>
> Intermittently, Bind9 simply stops handling queries from *some* hosts.
> Meaning, it simply times out for responses for those hosts.
> Yet BIND *is* working fine for lots of other machines on the same
> networks. It's working fine doing dig queries locally on the server, and
> handles dns queries fine for lots of other machines. Yet, again, some
> machines simply get time-outs. I can't find any pattern to which machines
> get timeouts and which don't.
>
> I've checked - no firewalls, fail2ban or the like that might be causing
> this.
> No selinux/apparmour.
> Hosts that can't do dns queries can ping the dns server fine.
> [So, there's at least some network pathway to the DNS machine.]
>
> Review of the logs for bind don't show anything that looks like a problem
> to me.
> [But I'm not sure what keywords I ought to be looking for, in an effort to
> find symptoms/problems.]
>
> Finally, the two bind/dhcp/ntp servers are currently running on the same
> Xen host, so if it's somehow host related, I'd expect both to have
> problems, but they don't.
>
> Top doesn't show any CPU distress.
> Processes look fine
> Memory in use is far below what allocated to the machine. [1G allocated,
> like <400M used.]
> Restart of BIND doesn't do anything, at least in the cases I've seen -
> which aren't all that many yet.
> A restart of the whole VM does appear to fix the issue immediately.
> These appear to occur every 3-5 days.
> Oh, and if you simply wait, it eventually starts handling queries for all
> hosts again - but it might be a couple+ hours.
>
> Any suggestions on things I might hunt for in the logs in an attempt to
> figure out what's happening?
> Other suggestions for things to look for/consider?
> 
> TIA
> -Greg
> ___
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to
> unsubscribe from this list
>
> bind-users mailing list
> bind-users@lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users
>
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Bind9 stops responding for some clients

2019-05-30 Thread Warren Kumari
On Thu, May 30, 2019 at 8:10 PM Gregory Sloop  wrote:
>
> So, this is a very odd situation and I'm kind of grasping at straws here.
> So, I've come to see if any of you have any good straws!
>
> The setup.
> ---
> Ubuntu 18.04 LTS is the distro we're running on.
> All software is packaged [from the distro] - not compiled from sources.
> Bind9 acting as a recursive resolver for a smallish network. 150 seats.
> They're also handling DHCP and Chrony/NTP requests.
> [I actually have a pair of these handling DNS/DHCP/NTP this is the master.]
>
> They are running on a Xen/XCP VM.
>
> The one I'm having problems is the master for several internal zones - the 
> one that's working fine is the slave for those same zones. None of the zones 
> are large.
>
> Intermittently, Bind9 simply stops handling queries from *some* hosts.
> Meaning, it simply times out for responses for those hosts.
> Yet BIND *is* working fine for lots of other machines on the same networks. 
> It's working fine doing dig queries locally on the server, and handles dns 
> queries fine for lots of other machines. Yet, again, some machines simply get 
> time-outs. I can't find any pattern to which machines get timeouts and which 
> don't.

This is probably a really long shot, but is it possible that the
machines which don't work are trying to use TCP to query the server
(e.g because of weird MTU issues, or similar)?
I recently ran into sporadic issues where BIND would simply stop
listening on TCP -- there would be nothing in the logs, but netstat
would confirm the there was suddenly nothing listening on TCP 53.

I created a prometheus rule to monitor for this:
  - name: DNS TCP
rules:
  - alert: DNS Port 53 down on ron.
expr: probe_success{instance="{{server}}",job="dns_tcp_port"}
== 0 or up{job="dns_tcp_port"} ==0
for: 5m
labels:
  severity: page
annotations:
  identifier: '{{ $labels.instance }}'
  summary: "DNS Port 53 down on Ron {{ $labels.instance }}"
  description: "{{ $labels.instance }} probe_success returned
{{ $value }}"

and it fired twice -- and then I upgraded to BIND 9.12.4-P1 and the
problem hasn't happened since...
The obvious questions:
1: what was I running on this machine before? I think 9.12.
-- will have to check git for more detail
2: why didn't I file a bug report / take a dump / something? I kept
meaning to, but it always broke at inopportune times, so I'd just
restart and plan to do a better job next time...

W

>
> I've checked - no firewalls, fail2ban or the like that might be causing this.
> No selinux/apparmour.
> Hosts that can't do dns queries can ping the dns server fine.
> [So, there's at least some network pathway to the DNS machine.]
>
> Review of the logs for bind don't show anything that looks like a problem to 
> me.
> [But I'm not sure what keywords I ought to be looking for, in an effort to 
> find symptoms/problems.]
>
> Finally, the two bind/dhcp/ntp servers are currently running on the same Xen 
> host, so if it's somehow host related, I'd expect both to have problems, but 
> they don't.
>
> Top doesn't show any CPU distress.
> Processes look fine
> Memory in use is far below what allocated to the machine. [1G allocated, like 
> <400M used.]
> Restart of BIND doesn't do anything, at least in the cases I've seen - which 
> aren't all that many yet.
> A restart of the whole VM does appear to fix the issue immediately.
> These appear to occur every 3-5 days.
> Oh, and if you simply wait, it eventually starts handling queries for all 
> hosts again - but it might be a couple+ hours.
>
> Any suggestions on things I might hunt for in the logs in an attempt to 
> figure out what's happening?
> Other suggestions for things to look for/consider?
>
> TIA
> -Greg
> ___
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
> from this list
>
> bind-users mailing list
> bind-users@lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users



-- 
I don't think the execution is relevant when it was obviously a bad
idea in the first place.
This is like putting rabid weasels in your pants, and later expressing
regret at having chosen those particular rabid weasels and that pair
of pants.
   ---maf
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Bind9 stops responding for some clients

2019-05-30 Thread Gregory Sloop
Ugh. Not wanting to packet capture. :)
[Yeah, not that hard, but it always seems to suck up so much time - it's like 
the black hole for time, I think.]
But, yeah, absent some other smoking gun, that's probably where we're headed. 

As for rate limiting - "rndc recursing" didn't show anything being rate 
limited. [No output. I assume that means there's nothing there. And ISC claims 
that rate limiting isn't turned on by default - I certainly haven't enabled 
anything related to rate limits.]

There's no firewalling/filtering between any of the affected clients and the 
DNS servers.
Without going into too much detail, the network's pretty flat. 
There are (4) /24 subnets, but they're all passing through a L3 switch for 
"routing." 
Essentially all filtering occurs at the border only, so, SPI or other stuff 
shouldn't be in the mix here.

Sigh. 
As my colleague said...
"Heh. How, um, fun?"

-Greg



Whilst you mentioned 150 seats and you mentioned 'no firewalls', you didn't 
mention the network topology at all, in particular is traffic passing through a 
commercial firewall/router (hardware or virtualized) to get to the DNS server? 
If there is, it may be worth checking what packet inspection is turned on for 
DNS traffic (Cisco, Juniper and Checkpoint have been known to have buggy 
inspection routines in the past).
 
I might also be worthwhile to see what your open filehandles are like and 
whether there's any rate limiting configured in the distributed BIND 
configuration.
 
Stuart
 
From: bind-users [mailto:bind-users-boun...@lists.isc.org] On Behalf Of John W. 
Blue
Sent: Friday, 31 May 2019 11:47 AM
To: bind-users@lists.isc.org
Subject: Re: Bind9 stops responding for some clients
 
Good job on the amount of troubleshooting work done so far.

Next steps should be to run tcpdump on the interface for port 53 to see what is 
happening when an outage is in progress.  What you will be looking for 
specifically is the query packet in and the response packet out.

Use the following command:

tcpdump -n -i eth0 port domain and host 172.24.67.32

Swap out eth0 for whatever you have configured and the host IP address for a 
host that is having problems.

John
 
Sent from Nine

From: Gregory Sloop 
Sent: Thursday, May 30, 2019 7:11 PM
To: bind-users@lists.isc.org
Subject: Bind9 stops responding for some clients
 
So, this is a very odd situation and I'm kind of grasping at straws here.
So, I've come to see if any of you have any good straws!

The setup.
---
Ubuntu 18.04 LTS is the distro we're running on. 
All software is packaged [from the distro] - not compiled from sources.
Bind9 acting as a recursive resolver for a smallish network. 150 seats.
They're also handling DHCP and Chrony/NTP requests.
[I actually have a pair of these handling DNS/DHCP/NTP this is the master.]

They are running on a Xen/XCP VM.

The one I'm having problems is the master for several internal zones - the one 
that's working fine is the slave for those same zones. None of the zones are 
large.

Intermittently, Bind9 simply stops handling queries from *some* hosts. 
Meaning, it simply times out for responses for those hosts.
Yet BIND *is* working fine for lots of other machines on the same networks. 
It's working fine doing dig queries locally on the server, and handles dns 
queries fine for lots of other machines. Yet, again, some machines simply get 
time-outs. I can't find any pattern to which machines get timeouts and which 
don't.

I've checked - no firewalls, fail2ban or the like that might be causing this. 
No selinux/apparmour.
Hosts that can't do dns queries can ping the dns server fine. 
[So, there's at least some network pathway to the DNS machine.]

Review of the logs for bind don't show anything that looks like a problem to me.
[But I'm not sure what keywords I ought to be looking for, in an effort to find 
symptoms/problems.]

Finally, the two bind/dhcp/ntp servers are currently running on the same Xen 
host, so if it's somehow host related, I'd expect both to have problems, but 
they don't.

Top doesn't show any CPU distress.
Processes look fine
Memory in use is far below what allocated to the machine. [1G allocated, like 
<400M used.]
Restart of BIND doesn't do anything, at least in the cases I've seen - which 
aren't all that many yet.
A restart of the whole VM does appear to fix the issue immediately.
These appear to occur every 3-5 days.
Oh, and if you simply wait, it eventually starts handling queries for all hosts 
again - but it might be a couple+ hours.

Any suggestions on things I might hunt for in the logs in an attempt to figure 
out what's happening?
Other suggestions for things to look for/consider?

TIA
-Greg

-- 
Gregory Sloop, Principal: Sloop Network & Computer Consulting
Voice: 503.251.0452 x82
EMail: gr...@sloop.net
http://www.sloop.net
---___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
fr

RE: Bind9 stops responding for some clients

2019-05-30 Thread Browne, Stuart via bind-users
Whilst you mentioned 150 seats and you mentioned 'no firewalls', you didn't 
mention the network topology at all, in particular is traffic passing through a 
commercial firewall/router (hardware or virtualized) to get to the DNS server? 
If there is, it may be worth checking what packet inspection is turned on for 
DNS traffic (Cisco, Juniper and Checkpoint have been known to have buggy 
inspection routines in the past).

I might also be worthwhile to see what your open filehandles are like and 
whether there's any rate limiting configured in the distributed BIND 
configuration.

Stuart

From: bind-users [mailto:bind-users-boun...@lists.isc.org] On Behalf Of John W. 
Blue
Sent: Friday, 31 May 2019 11:47 AM
To: bind-users@lists.isc.org
Subject: Re: Bind9 stops responding for some clients

Good job on the amount of troubleshooting work done so far.

Next steps should be to run tcpdump on the interface for port 53 to see what is 
happening when an outage is in progress.  What you will be looking for 
specifically is the query packet in and the response packet out.

Use the following command:

tcpdump -n -i eth0 port domain and host 172.24.67.32

Swap out eth0 for whatever you have configured and the host IP address for a 
host that is having problems.

John

Sent from 
Nine<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.9folders.com_=DwMFbA=MOptNlVtIETeDALC_lULrw=udvvbouEjrWNUMab5xo_vLbUE6LRGu5fmxLhrDvVJS8=ADw0VCxrFo5_MgR_4Mvuak8e6_Bz5W3If-tuz4YVIXM=EYgttPlO5oQ9lFkW6hZ2eqOoiEcRWS9uOvmT-Wg6Zm0=>

From: Gregory Sloop mailto:gr...@sloop.net>>
Sent: Thursday, May 30, 2019 7:11 PM
To: bind-users@lists.isc.org<mailto:bind-users@lists.isc.org>
Subject: Bind9 stops responding for some clients

So, this is a very odd situation and I'm kind of grasping at straws here.
So, I've come to see if any of you have any good straws!

The setup.
---
Ubuntu 18.04 LTS is the distro we're running on.
All software is packaged [from the distro] - not compiled from sources.
Bind9 acting as a recursive resolver for a smallish network. 150 seats.
They're also handling DHCP and Chrony/NTP requests.
[I actually have a pair of these handling DNS/DHCP/NTP this is the master.]

They are running on a Xen/XCP VM.

The one I'm having problems is the master for several internal zones - the one 
that's working fine is the slave for those same zones. None of the zones are 
large.

Intermittently, Bind9 simply stops handling queries from *some* hosts.
Meaning, it simply times out for responses for those hosts.
Yet BIND *is* working fine for lots of other machines on the same networks. 
It's working fine doing dig queries locally on the server, and handles dns 
queries fine for lots of other machines. Yet, again, some machines simply get 
time-outs. I can't find any pattern to which machines get timeouts and which 
don't.

I've checked - no firewalls, fail2ban or the like that might be causing this.
No selinux/apparmour.
Hosts that can't do dns queries can ping the dns server fine.
[So, there's at least some network pathway to the DNS machine.]

Review of the logs for bind don't show anything that looks like a problem to me.
[But I'm not sure what keywords I ought to be looking for, in an effort to find 
symptoms/problems.]

Finally, the two bind/dhcp/ntp servers are currently running on the same Xen 
host, so if it's somehow host related, I'd expect both to have problems, but 
they don't.

Top doesn't show any CPU distress.
Processes look fine
Memory in use is far below what allocated to the machine. [1G allocated, like 
<400M used.]
Restart of BIND doesn't do anything, at least in the cases I've seen - which 
aren't all that many yet.
A restart of the whole VM does appear to fix the issue immediately.
These appear to occur every 3-5 days.
Oh, and if you simply wait, it eventually starts handling queries for all hosts 
again - but it might be a couple+ hours.

Any suggestions on things I might hunt for in the logs in an attempt to figure 
out what's happening?
Other suggestions for things to look for/consider?

TIA
-Greg
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Bind9 stops responding for some clients

2019-05-30 Thread John W. Blue
Good job on the amount of troubleshooting work done so far.

Next steps should be to run tcpdump on the interface for port 53 to see what is 
happening when an outage is in progress.  What you will be looking for 
specifically is the query packet in and the response packet out.

Use the following command:

tcpdump -n -i eth0 port domain and host 172.24.67.32

Swap out eth0 for whatever you have configured and the host IP address for a 
host that is having problems.

John

Sent from Nine<http://www.9folders.com/>

From: Gregory Sloop 
Sent: Thursday, May 30, 2019 7:11 PM
To: bind-users@lists.isc.org
Subject: Bind9 stops responding for some clients

So, this is a very odd situation and I'm kind of grasping at straws here.
So, I've come to see if any of you have any good straws!

The setup.
---
Ubuntu 18.04 LTS is the distro we're running on.
All software is packaged [from the distro] - not compiled from sources.
Bind9 acting as a recursive resolver for a smallish network. 150 seats.
They're also handling DHCP and Chrony/NTP requests.
[I actually have a pair of these handling DNS/DHCP/NTP this is the master.]

They are running on a Xen/XCP VM.

The one I'm having problems is the master for several internal zones - the one 
that's working fine is the slave for those same zones. None of the zones are 
large.

Intermittently, Bind9 simply stops handling queries from *some* hosts.
Meaning, it simply times out for responses for those hosts.
Yet BIND *is* working fine for lots of other machines on the same networks. 
It's working fine doing dig queries locally on the server, and handles dns 
queries fine for lots of other machines. Yet, again, some machines simply get 
time-outs. I can't find any pattern to which machines get timeouts and which 
don't.

I've checked - no firewalls, fail2ban or the like that might be causing this.
No selinux/apparmour.
Hosts that can't do dns queries can ping the dns server fine.
[So, there's at least some network pathway to the DNS machine.]

Review of the logs for bind don't show anything that looks like a problem to me.
[But I'm not sure what keywords I ought to be looking for, in an effort to find 
symptoms/problems.]

Finally, the two bind/dhcp/ntp servers are currently running on the same Xen 
host, so if it's somehow host related, I'd expect both to have problems, but 
they don't.

Top doesn't show any CPU distress.
Processes look fine
Memory in use is far below what allocated to the machine. [1G allocated, like 
<400M used.]
Restart of BIND doesn't do anything, at least in the cases I've seen - which 
aren't all that many yet.
A restart of the whole VM does appear to fix the issue immediately.
These appear to occur every 3-5 days.
Oh, and if you simply wait, it eventually starts handling queries for all hosts 
again - but it might be a couple+ hours.

Any suggestions on things I might hunt for in the logs in an attempt to figure 
out what's happening?
Other suggestions for things to look for/consider?
<mailto:gr...@sloop.net>
TIA
-Greg
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Bind9 stops responding for some clients

2019-05-30 Thread Gregory Sloop
So, this is a very odd situation and I'm kind of grasping at straws here.
So, I've come to see if any of you have any good straws!

The setup.
---
Ubuntu 18.04 LTS is the distro we're running on. 
All software is packaged [from the distro] - not compiled from sources.
Bind9 acting as a recursive resolver for a smallish network. 150 seats.
They're also handling DHCP and Chrony/NTP requests.
[I actually have a pair of these handling DNS/DHCP/NTP this is the master.]

They are running on a Xen/XCP VM.

The one I'm having problems is the master for several internal zones - the one 
that's working fine is the slave for those same zones. None of the zones are 
large.

Intermittently, Bind9 simply stops handling queries from *some* hosts. 
Meaning, it simply times out for responses for those hosts.
Yet BIND *is* working fine for lots of other machines on the same networks. 
It's working fine doing dig queries locally on the server, and handles dns 
queries fine for lots of other machines. Yet, again, some machines simply get 
time-outs. I can't find any pattern to which machines get timeouts and which 
don't.

I've checked - no firewalls, fail2ban or the like that might be causing this. 
No selinux/apparmour.
Hosts that can't do dns queries can ping the dns server fine. 
[So, there's at least some network pathway to the DNS machine.]

Review of the logs for bind don't show anything that looks like a problem to me.
[But I'm not sure what keywords I ought to be looking for, in an effort to find 
symptoms/problems.]

Finally, the two bind/dhcp/ntp servers are currently running on the same Xen 
host, so if it's somehow host related, I'd expect both to have problems, but 
they don't.

Top doesn't show any CPU distress.
Processes look fine
Memory in use is far below what allocated to the machine. [1G allocated, like 
<400M used.]
Restart of BIND doesn't do anything, at least in the cases I've seen - which 
aren't all that many yet.
A restart of the whole VM does appear to fix the issue immediately.
These appear to occur every 3-5 days.
Oh, and if you simply wait, it eventually starts handling queries for all hosts 
again - but it might be a couple+ hours.

Any suggestions on things I might hunt for in the logs in an attempt to figure 
out what's happening?
Other suggestions for things to look for/consider?

TIA
-Greg___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users