Hey... I'm the OP. We're using a mix of client tools. For Windows systems (which aren't affected by this) we use nsclient++. For our Linux servers, NRPE... for UNIX (Solaris) and OS X we're using check_by_ssh. Both the NRPE and check_by_ssh clients are affected by this.

I'm willing to give the caching nameserver on the server a try, but as others have noted, I don't think it will make a difference as its the local test on the client that's failing to resolv. I surely cannot do a caching nameserver setup on all clients...

 A. Davis
 Email:     [email protected]

 "There is no limit to what a man can accomplish
  if he doesn't care who gets the credit." - Ronald Reagan



Martin Melin wrote:
I don't know if I'm misreading the OP, but if the plugins start timing out on only the boxes whose primary DNS is being rebooted, would adding a caching DNS server to the Nagios box really make a difference?

I think the root cause to these timeouts is that the Nagios plugin timeout is happening before the connection to the primary DNS on the target machine has a chance to time out and then connect to the secondary DNS.

The correct course of action to resolve this would be to either make sure that the DNS connection on the target machines fail quicker, or that Nagios/the plugin waits longer for a result from the check. The DNS failover is working as designed here but you're not giving it enough time to kick in.

On Tue, Jun 9, 2009 at 5:37 PM, Russell Adams <[email protected] <mailto:[email protected]>> wrote:

    Really the best choice is to using caching DNS on the Nagios
    server. I'd recommend dnsmasq, it just does caching locally without
    needing to do big zone transfers. It has low overhead and simple
    configuration as a result.

    Enjoy.

    On Tue, Jun 09, 2009 at 11:19:20AM -0400, Andrew Davis wrote:
    > I've observed an interesting issue with Nagios. Our environment
    is a mix
    > of UNIX, Linux, Apple, and Windows. The core of the network is
    Active
    > Directory including two AD servers that are both our primary,
    internal
    > DNS servers. All non-Windows systems have a resolv.conf that
    looks like:
    >
    >    *nameserver 10.1.1.13
    >    nameserver 10.1.1.14
    >    domain int.our.domain
    >    search int.our.domain*
    >
    > About half of the servers have the nameserver entries inverted
    (ie: .14
    > first, .13 second).
    >
    > The issue is that anytime one of the nameservers is rebooted (at
    least
    > once a month if staying current on patches thanks to Black
    Tuesdays),
    > whichever hosts have that nameserver listed first in its resolv.conf
    > start throwing the following errors:
    >
    >    *CRITICAL - Plugin timed out while executing system call.*
    >
    > This occurs for multiple tests for each host. Obviously, there's
    a name
    > resolution correlation here. If the nameserver with .13 is
    rebooted, all
    > hosts (about half of them) that list this IP first in their
    resolve.conf
    > then timeout for multiple tests. If the .14 server is rebooted,
    all the
    > other hosts timeout. Interestingly, none of the Windows clients
    issue
    > errors... only UNIX, Linux, and Mac's... only those with an
    > /etc/resolv.conf. The end result is a host of "false positives", but
    > more importantly it looks bad on availability reports and causes
    > phones/pagers to go ballistic with unneeded emails.
    >
    > I'm trying to find a solution and I can't find one that I like:
    >
    > Solution 1) is to cluster the DNS servers. We have lots of clusters
    > here. This isn't good, though, as you don't normally cluster DNS
    > servers... they're meant to be redundant for a reason... one
    fails and
    > it uses the next one.
    >
    > Solution 2) is to setup a service/host dependency. My thought
    would be
    > either a host dependency that says if either .13 or .14 are
    down, then
    > don't alert for any other host that uses them. Or a service to host
    > dependency... if the DNS service is down, then don't alert on any of
    > these dependent hosts. Honestly, I'm not sure if you can mix
    host and
    > service dependencies like this... plus... if the DNS server is
    actually
    > down, then the DNS service is down, so better to use a host
    dependency.
    > The problem is that now we're not alerting on any dependent
    hosts which
    > themselves could have a legitimate issue we want to know about.
    Plus,
    > what happens if the DNS server actually dies and take a few
    hours/days
    > to rebuild/restore? At this point, the dependent hosts aren't
    watched
    > for a very long time.
    >
    > Solution 3) is to setup a UNIX/Linux DNS server that slaves all
    zones
    > from the AD servers and have all UNIX/Linux/Apple clients query from
    > this server. This would work except that A) I need two of them
    to keep
    > redundancy and B) I've now added an extra layer of complication to
    > resolve an application (Nagios)... not exactly good practice.
    >
    > Solution 4) is to set the timeout value of a host querying a DNS
    server.
    > Perhaps adjust the client to timeout on the first listed nameserver
    > after only 10 seconds, then try the next one? Since most Nagios
    tests
    > have a minimum timeout value of 30 seconds, if the first DNS
    query timed
    > out after 10 seconds, it would go to the next one with, hopefully,
    > enough time to respond. The downside is having to adjust every
    single
    > server.
    >
    > Has anyone else seen this? Anyone else using Windows AD servers to
    > provide DNS for *nix servers?
    >
    > --
    >
    >
    >  A. Davis
    >  Email:     [email protected] <mailto:[email protected]>
    >
    >  "There is no limit to what a man can accomplish
    >   if he doesn't care who gets the credit." - Ronald Reagan
    >

    >
    
------------------------------------------------------------------------------
    > Crystal Reports - New Free Runtime and 30 Day Trial
    > Check out the new simplified licensing option that enables unlimited
    > royalty-free distribution of the report engine for externally facing
    > server and web deployment.
    > http://p.sf.net/sfu/businessobjects
    > _______________________________________________
    > Nagios-users mailing list
    > [email protected]
    <mailto:[email protected]>
    > https://lists.sourceforge.net/lists/listinfo/nagios-users
    > ::: Please include Nagios version, plugin version (-v) and OS
    when reporting any issue.
    > ::: Messages without supporting info will risk being sent to
    /dev/null


    ------------------------------------------------------------------
    Russell Adams                            [email protected]

    PGP Key ID:     0x1160DCB3           http://www.adamsinfoserv.com/

    Fingerprint:    1723 D8CA 4280 1EC9 557F  66E8 1154 E018 1160 DCB3

    
------------------------------------------------------------------------------
    Crystal Reports - New Free Runtime and 30 Day Trial
    Check out the new simplified licensing option that enables unlimited
    royalty-free distribution of the report engine for externally facing
    server and web deployment.
    http://p.sf.net/sfu/businessobjects
    _______________________________________________
    Nagios-users mailing list
    [email protected]
    <mailto:[email protected]>
    https://lists.sourceforge.net/lists/listinfo/nagios-users
    ::: Please include Nagios version, plugin version (-v) and OS when
    reporting any issue.
    ::: Messages without supporting info will risk being sent to /dev/null


------------------------------------------------------------------------

------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing server and web deployment.
http://p.sf.net/sfu/businessobjects
------------------------------------------------------------------------

_______________________________________________
Nagios-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null
------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing 
server and web deployment.
http://p.sf.net/sfu/businessobjects
_______________________________________________
Nagios-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Reply via email to