Sporadic but noticable SERVFAILs in specific nodes of an anycast resolving farm running BIND

2014-03-05 Thread Kostas Zorbadelos

Greetings to all,

we operate an anycast caching resolving farm for our customer base,
based on CentOS (6.4 or 6.5), BIND (9.9.2, 9.9.5 or the stock CentOS
package BIND 9.8.2rc1-RedHat-9.8.2-0.23.rc1.el6_5.1) and quagga (the
stock CentOS package).

The problem is that we have noticed sporadic but noticable SERVFAILs in
3 out of 10 total machines. Cacti measurements obtained via the BIND XML
interface show traffic from 1.5K queries/sec (lowest loaded machines) to
15K queries/sec (highest). The problem is that in 3 specific machines in
a geolocation with a BIND restart we notice after a period of time that
can range between half an hour and several hours SERVFAILs in
resolutions. The 3 machines do not have the highest load in the farm
(6-8K q/sec). The resolution problems are noticable in the customers
ending up in these machines but do not show up as high numbers in the
BIND XML Resolver statistics (ServFail number).

We reproduce the problem, by querying for a specific domain name using
a loop of the form

while [ 1 ]; do clear; rndc flushname www.linux-tutorial.info; sleep 1;
dig www.linux-tutorial.info @localhost; sleep 2; done  | grep SERVFAIL

The www.linux-tutorial.info is not the only domain experiencing
resolution problems of course. The above loop can run for hours even
without issues on low-traffic hours (night, after a clean BIND restart)
but during the day it shows quite a few SERVFAILs, which affect other
domains as well.

During the problem we notice with tcpdump, that when SERVFAIL is
produced, no query packet exits the server for resolution. We have
noticed nothing in BIND logs (we even tried to raise debugging levels
and log all relevant categories). An example capture running the above
loop: 

# tcpdump -nnn -i any -p dst port 53 or src port 53 | grep 'linux-tutorial'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes

14:33:03.590908 IP6 ::1.53059  ::1.53: 15773+ A? www.linux-tutorial.info. (41) 

14:33:03.591292 IP 83.235.72.238.45157  213.133.105.6.53: 19156% [1au] A? 
www.linux-tutorial.info. (52)
 Success

14:33:06.664411 IP6 ::1.45090  ::1.53: 48526+ A? www.linux-tutorial.info. (41)
14:33:06.664719 IP6 2a02:587:50da:b::1.23404  2a00:1158:4::add:a3.53: 30244% 
[1au] A? www.linux-tutorial.info. (52)
 Success

14:33:31.434209 IP6 ::1.43397  ::1.53: 26607+ A? www.linux-tutorial.info. (41)
 SERVFAIL

14:33:43.672405 IP6 ::1.58282  ::1.53: 27125+ A? www.linux-tutorial.info. (41)
 SERVFAIL

14:33:49.706645 IP6 ::1.54936  ::1.53: 40435+ A? www.linux-tutorial.info. (41)
14:33:49.706976 IP6 2a02:587:50da:b::1.48961  2a00:1158:4::add:a3.53: 4287% 
[1au] A? www.linux-tutorial.info. (52)
 Success

The main actions we have done on the problem machines are

- change the BIND version (we initially used a custom compiled 9.9.2, we
  moved to 9.9.5 and finally switched over to the CentOS stock package
  9.8.2rc1). We noticed the problem in all versions

- disable IPtables (we use a ruleset with connection tracking in all of
  our machines with no problems on the other machines in the
  farm). Again no solution

- introduce query-source-v6 address in named.conf (we already had
  query-source). Each machine has a single physical interface and 3
  loopbacks with the anycast IPs, announced via Quagga ospfd to the rest
  of the network. No solution. 

The main difference in the 3 machines from the rest is the IPv6
operation. Those machines are dual stack, having /30 (v4) and /127 (v6)
on the physical interface. Needless to say that the next trial is to
remove the relevant IPv6 configuration.

I understand that there are many parameters to the problem, we try and
debug the issue several days now. Any suggestion, suspicion or hint is
highly welcome. I can provide all sorts of traces from the machines (I
already have pcap files at the moment of the problem, plus pstack, rndc
status, OS process limits, rndc recursing, rndc dumpdb -all, according
to 

https://kb.isc.org/article/AA-00341/0/What-to-do-with-a-misbehaving-BIND-server.html)

Thanks in advance,

Kostas
 
-- 
Kostas Zorbadelos   
twitter:@kzorbadeloshttp://gr.linkedin.com/in/kzorba 

()  www.asciiribbon.org - against HTML e-mail  proprietary attachments
/\  
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Sporadic but noticable SERVFAILs in specific nodes of an anycast resolving farm running BIND

2014-03-05 Thread Klaus Darilion

Does it only happen for IPv6 DNS requests? Maybe it is related to this:
https://open.nlnetlabs.nl/pipermail/nsd-users/2014-January/001783.html

klaus

On 05.03.2014 14:16, Kostas Zorbadelos wrote:


Greetings to all,

we operate an anycast caching resolving farm for our customer base,
based on CentOS (6.4 or 6.5), BIND (9.9.2, 9.9.5 or the stock CentOS
package BIND 9.8.2rc1-RedHat-9.8.2-0.23.rc1.el6_5.1) and quagga (the
stock CentOS package).

The problem is that we have noticed sporadic but noticable SERVFAILs in
3 out of 10 total machines. Cacti measurements obtained via the BIND XML
interface show traffic from 1.5K queries/sec (lowest loaded machines) to
15K queries/sec (highest). The problem is that in 3 specific machines in
a geolocation with a BIND restart we notice after a period of time that
can range between half an hour and several hours SERVFAILs in
resolutions. The 3 machines do not have the highest load in the farm
(6-8K q/sec). The resolution problems are noticable in the customers
ending up in these machines but do not show up as high numbers in the
BIND XML Resolver statistics (ServFail number).

We reproduce the problem, by querying for a specific domain name using
a loop of the form

while [ 1 ]; do clear; rndc flushname www.linux-tutorial.info; sleep 1;
dig www.linux-tutorial.info @localhost; sleep 2; done  | grep SERVFAIL

The www.linux-tutorial.info is not the only domain experiencing
resolution problems of course. The above loop can run for hours even
without issues on low-traffic hours (night, after a clean BIND restart)
but during the day it shows quite a few SERVFAILs, which affect other
domains as well.

During the problem we notice with tcpdump, that when SERVFAIL is
produced, no query packet exits the server for resolution. We have
noticed nothing in BIND logs (we even tried to raise debugging levels
and log all relevant categories). An example capture running the above
loop:

# tcpdump -nnn -i any -p dst port 53 or src port 53 | grep 'linux-tutorial'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes

14:33:03.590908 IP6 ::1.53059  ::1.53: 15773+ A? www.linux-tutorial.info. (41)
14:33:03.591292 IP 83.235.72.238.45157  213.133.105.6.53: 19156% [1au] A? 
www.linux-tutorial.info. (52)
 Success

14:33:06.664411 IP6 ::1.45090  ::1.53: 48526+ A? www.linux-tutorial.info. (41)
14:33:06.664719 IP6 2a02:587:50da:b::1.23404  2a00:1158:4::add:a3.53: 30244% 
[1au] A? www.linux-tutorial.info. (52)
 Success

14:33:31.434209 IP6 ::1.43397  ::1.53: 26607+ A? www.linux-tutorial.info. (41)
 SERVFAIL

14:33:43.672405 IP6 ::1.58282  ::1.53: 27125+ A? www.linux-tutorial.info. (41)
 SERVFAIL

14:33:49.706645 IP6 ::1.54936  ::1.53: 40435+ A? www.linux-tutorial.info. (41)
14:33:49.706976 IP6 2a02:587:50da:b::1.48961  2a00:1158:4::add:a3.53: 4287% 
[1au] A? www.linux-tutorial.info. (52)
 Success

The main actions we have done on the problem machines are

- change the BIND version (we initially used a custom compiled 9.9.2, we
   moved to 9.9.5 and finally switched over to the CentOS stock package
   9.8.2rc1). We noticed the problem in all versions

- disable IPtables (we use a ruleset with connection tracking in all of
   our machines with no problems on the other machines in the
   farm). Again no solution

- introduce query-source-v6 address in named.conf (we already had
   query-source). Each machine has a single physical interface and 3
   loopbacks with the anycast IPs, announced via Quagga ospfd to the rest
   of the network. No solution.

The main difference in the 3 machines from the rest is the IPv6
operation. Those machines are dual stack, having /30 (v4) and /127 (v6)
on the physical interface. Needless to say that the next trial is to
remove the relevant IPv6 configuration.

I understand that there are many parameters to the problem, we try and
debug the issue several days now. Any suggestion, suspicion or hint is
highly welcome. I can provide all sorts of traces from the machines (I
already have pcap files at the moment of the problem, plus pstack, rndc
status, OS process limits, rndc recursing, rndc dumpdb -all, according
to

https://kb.isc.org/article/AA-00341/0/What-to-do-with-a-misbehaving-BIND-server.html)

Thanks in advance,

Kostas



___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Sporadic but noticable SERVFAILs in specific nodes of an anycast resolving farm running BIND

2014-03-05 Thread Marco Davids (SIDN)
On 05/03/14 15:15, Klaus Darilion wrote:
 Does it only happen for IPv6 DNS requests? Maybe it is related to this:
 https://open.nlnetlabs.nl/pipermail/nsd-users/2014-January/001783.html

Or, less likely, this:

http://marc.info/?l=linux-netdevm=139352943109400w=2

--
Marco


___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


RE: Regarding HMAC-SHA256 and RSASHA512 key generation algorithm in dnssec-keygen

2014-03-05 Thread Gaurav Kansal
HI Tony,

 

Thanks for help.

I was wondering if HMAC* keys are not used for zone then why the same is
displayed when we use dnssec-keygen -h.

 

Regards,

Gaurav Kansal

 

-Original Message-
From: Tony Finch [mailto:fa...@hermes.cam.ac.uk] On Behalf Of Tony Finch
Sent: Monday, March 3, 2014 3:58 AM
To: Gaurav Kansal
Cc: bind-users@lists.isc.org
Subject: Re: Regarding HMAC-SHA256 and RSASHA512 key generation algorithm in
dnssec-keygen

 

Gaurav Kansal  mailto:gaurav.kan...@nic.in gaurav.kan...@nic.in wrote:

 

 I have doubt in this only. What's the difference between Zone or Host ??

 

Zone keys are used for DNSSEC signing zones.

 

Host keys are used for TSIG transaction authentication, for securing zone
transfers or dynamic updates.

 

 I also want to know which algorithm is the best one on security 

 aspects for generating Keys for DNSSEC.

 

Your security is affected more by how you store the keys than anything else.
RSASHA256 is fine.

 

Tony.

--

f.anthony.n.finch   mailto:d...@dotat.at d...@dotat.at
http://dotat.at/ http://dotat.at/

Faeroes: East or southeast 5 to 7. Rough or very rough. Rain. Moderate.

___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Regarding zone trf from master to slave

2014-03-05 Thread Gaurav Kansal
Dear Team,

 

We are running slave services for our customers.

We want to have log of what entries has been changed in the master (which is
causing this zone transfer) at the time of zone transfer.

 

I want to know whether it is possible to have some sort of log generation
(either by using query channels or by any other means) which we can save for
future reference purposes.

 

 

Thanks and Regards,

Gaurav Kansal

___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: Regarding zone trf from master to slave

2014-03-05 Thread Tony Finch
Gaurav Kansal gaurav.kan...@nic.in wrote:

 We are running slave services for our customers.

 We want to have log of what entries has been changed in the master (which is
 causing this zone transfer) at the time of zone transfer.

 I want to know whether it is possible to have some sort of log generation
 (either by using query channels or by any other means) which we can save for
 future reference purposes.

Are the zone journal files on the slaves useful for solving your problem?

e.g. my nameserver logs

05-Mar-2014 09:36:19.992 general: info: zone cam.ac.uk/IN/auth: transferred 
serial 1394009951
05-Mar-2014 09:36:19.992 xfer-in: info: transfer of 'cam.ac.uk/IN/auth' from 
2001:630:212:8::d:a0#53: Transfer completed: 16 messages, 5572 records, 935172 
bytes, 0.118 secs (7925186 bytes/sec)
[...]
05-Mar-2014 15:54:30.008 general: info: zone cam.ac.uk/IN/auth: transferred 
serial 1394024357
05-Mar-2014 15:54:30.008 xfer-in: info: transfer of 'cam.ac.uk/IN/auth' from 
2001:630:212:8::d:a0#53: Transfer completed: 1 messages, 266 records, 34454 
bytes, 0.009 secs (3828222 bytes/sec)

If I run named-journalprint I can work out the contents the second IXFR
based on the SOA serial numbers, starting with the line deleting the SOA
with previously transferred serial, and ending with the last contiguous
add line after the SOA with the current serial.

Tony.
-- 
f.anthony.n.finch  d...@dotat.at  http://dotat.at/
Fisher, German Bight: South or southwest 3 or 4, increasing 5 or 6. Slight
becoming moderate. Fog patches in east, rain later. Moderate, occasionally
very poor in east.
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Regarding zone trf from master to slave

2014-03-05 Thread Graham Clinch
Hi,

 We want to have log of what entries has been changed in the master
 (which is causing this zone transfer) at the time of zone transfer.

Two options come to mind:

1) Log the output of 'dig -t ixfr=2014030501 example.org' occasionally,
updating the serial to query for changes since the last run.  If the
master doesn't provide IXFR, you could enable 'ixfr-from-differences' on
a slave and then query the slave.

2) If your slave has access to a zone journal (because the master
supports IXFRs or you have 'ixfr-from-differences' enabled), log the
output of 'named-journalprint example.org.jnl'

Graham

___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Regarding HMAC-SHA256 and RSASHA512 key generation algorithm in dnssec-keygen

2014-03-05 Thread Alan Clegg
On 3/6/14, 12:40 AM, Gaurav Kansal wrote:

 I was wondering if HMAC* keys are not used for zone then why the same is
 displayed when we use dnssec-keygen -h

Because dnssec-keygen is used to generate more than just DNSSEC zone keys.

AlanC



signature.asc
Description: OpenPGP digital signature
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: Regarding HMAC-SHA256 and RSASHA512 key generation algorithm in dnssec-keygen

2014-03-05 Thread Carsten Strotmann
Gaurav Kansal gaurav.kan...@nic.in writes:


 I was wondering if HMAC* keys are not used for zone then why the same
 is displayed when we use dnssec-keygen -h.

the tool dnssec-keygen can be used to create both zone keys (with
-n ZONE) for DNSSEC zone signing, and host keys (with -n HOST) for
TSIG signing of the communication between hosts.

Keys of type zone are public/private key pairs
(https://en.wikipedia.org/wiki/Public-key_cryptography), whereas key of
type host are symmetric keys
(https://en.wikipedia.org/wiki/Symmetric-key_algorithm). 

To add to the confusion, dnssec-keygen generates two files when used
with -n HOST:

shell dnssec-keygen -a HMAC-MD5 -b 512 -n HOST ns1.example.com
Kns1.example.com.+157+16495
shell ls -l Kns1.example.com.+157+16495.*
-rw---  1 cas  staff  124 Mar  6 08:48
Kns1.example.com.+157+16495.key
-rw---  1 cas  staff  229 Mar  6 08:48
Kns1.example.com.+157+16495.private

These are symmetric TSIG keys, both files contain the same secret key
(although the filename-extensions migh indicate a public-private key
pair)!

To create a DNSSEC zone key, use:

shell dnssec-keygen -a RSASHA512 -b 2048 -n ZONE example.com
Generating key pair...+++ ..+++ 
Kexample.com.+010+18335
shell ls -l Kexample.com.+010+18335.* 
-rw-r--r--  1 cas  staff   607 Mar  6 08:51 Kexample.com.+010+18335.key
-rw---  1 cas  staff  1777 Mar  6 08:51
Kexample.com.+010+18335.private

This time the file with the extension .key contains the public key
(DNSKEY) resource record, and the file with the extension .private
contains the private key.

I agree that it might be nice to change dnssec-keygen to make the tool
more userfriendly. The current state-of-things is because of historic
developments in how DNSSEC came to birth.

-- Carsten
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users