On 03/16/2013 02:21 PM, Vernon Schryver wrote:
From: Phil Mayers <p.may...@imperial.ac.uk>

In the last 12 hours, we've had repeated instances of named getting
wedged. The symptoms are:

* named consuming nearly 100% CPU, all in user-time
* lots of queries apparently not processed, and based on query
logging, a sharp drop in the rate of queries that are

* a very sharp drop (almost a complete halt, in fact) in the rate of
RPZ "hits" in the logs at the exact time this happens
* no other interesting logs, as far as I can see

How can the rate of RPZ hits not drop along with a sharp drop in the
rate of queries?

Let me rephrase:

Successful queries dropped by 3-6x. RPZ matches dropped by a far, far larger fraction - essentially to zero.

In addition, I have no reason to believe the actual rate of queries dropped - instead, that named started to answer a subset, due to CPU load.


Examination of the journal suggests they deleted and re-added more or
less every record in the zone (presumably an error at their side).

Wouldn't deleting more or less every record in the response policy zone
tend to reduce the rate of RPZ hits?

Yes. But I would not expect named CPU to rise to 100% and stay there.

Does anyone else slave the Spamhaus RPZ and saw this? It seems like
there might be a bind bug here with large updates to RPZ.

Not to defend RPZ, but what is the evidence that links RPZ to the
problem problem, even if it is related to large updates of a zone
instead of, for example, NSEC chains?

I see the large XFR of the RPZ zone, then seconds later the service falls off-net. This happened on 4 separate servers.

Conversely, incremental re-signing of ic.ac.uk/imperial.ac.uk takes places regularly throughout the day (we're using "auto-dnssec maintain"). Nothing special happened there.

I have pretty much all logging going to one file or another, and I couldn't find any events for about 10 minutes prior to that which seem a likely candidate.


How many times did named get wedged? According to the theory that
the problem is related to large updates of policy zones, there
should have been at most 3 instances of wedged named processes per
computer and they should have happened during or soon after the end
of large rpz.spamhaus.org transfers.

We saw 4 large RPZ transfers:

22:44 - service unaffected
00:02 - named goes to 100% CPU
07:19 - service unaffected
11:02 - named goes to 100% CPU

It is difficult to tell, with the info I have to hand (logs, journal) but I am assuming that the pattern was either:

bulk delete
bulk add
bulk delete
bulk add

...or

bulk add
bulk delete
bulk add
bulk delete

The number of records in the XFR would tend to support that, but I cannot be certain since bind doesn't give me the info to know that.


My logs have these instances of transfers of rpz.spamhaus.org involving
at least 100 messages during March (NTP disciplined UTC timestamps):

   02-Mar-2013 21:45:42.511 07-Mar-2013 22:47:56.423 08-Mar-2013 03:19:46.419
   08-Mar-2013 03:26:50.262 08-Mar-2013 07:27:13.176 08-Mar-2013 07:33:29.203
   08-Mar-2013 10:07:05.829 08-Mar-2013 11:18:09.837 15-Mar-2013 22:52:02.969
   16-Mar-2013 00:04:14.447 16-Mar-2013 07:21:07.576 16-Mar-2013 11:06:46.515

The last 4 you have correspond to what I see.

Maybe it's not RPZ, but I don't see how it could be DNSSEC; our signed zones are slaved from a hidden master, and the XFRs were the usual trickle of resignings.

If it's not RPZ, those xfr timings are very coincidental. But of course, I'm just guessing. It could be phase of the moon for all I know at this stage.
_______________________________________________
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Reply via email to