Re: [Nagios-users] Nagios2 process overwhelmed by NSCA daemon?

2009-12-10 Thread Greg Pangrazio
Are you running the full nagios on the slaves?  Do the checks seem
to be working on those hosts?

Greg Pangrazio
pangr...@gmail.com





On Wed, Dec 9, 2009 at 5:06 PM, Jonathan Call jc...@verio.net wrote:
 I recently added two new slaves to a distributed Nagios system. The
 central server now passively processes 17,000+ service checks on 3000+
 servers.

 It's been over an hour and a half since I brought those new slaves
 online and I have about 150 hosts still stuck in 'Pending' and about
 1300 services in the same state. In addition to that it seems that the
 service check results from the other slaves that were working normally
 are now arbitrarily disappearing. For example, on one host three of the
 service checks have been updated relatively recently (i.e. 5-30 minutes
 ago) but three other service checks haven't been updated for almost an
 hour. The slaves all appear operational and the hosts are being checked
 on time. Is it possible I've overwhelmed Nagios' ability to process data
 from the NSCA daemon or struck some internal Nagios bottleneck? Any
 suggestions would be appreciated.

 Jonathan


 This email message is intended for the use of the person to whom it has been 
 sent, and may contain information that is confidential or legally protected. 
 If you are not the intended recipient or have received this message in error, 
 you are not authorized to copy, distribute, or otherwise use this message or 
 its attachments. Please notify the sender immediately by return e-mail and 
 permanently delete this message and any attachments. Verio, Inc. makes no 
 warranty that this email is error or virus free.  Thank you.

 --
 Return on Information:
 Google Enterprise Search pays you back
 Get the facts.
 http://p.sf.net/sfu/google-dev2dev
 ___
 Nagios-users mailing list
 Nagios-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/nagios-users
 ::: Please include Nagios version, plugin version (-v) and OS when reporting 
 any issue.
 ::: Messages without supporting info will risk being sent to /dev/null


--
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Nagios2 process overwhelmed by NSCA daemon?

2009-12-10 Thread Marcel
In my last job, I was dealing with a nagios install a little bit over than
yours,

On Wed, Dec 9, 2009 at 9:06 PM, Jonathan Call jc...@verio.net wrote:

 I recently added two new slaves to a distributed Nagios system. The
 central server now passively processes 17,000+ service checks on 3000+
 servers.

 It's been over an hour and a half since I brought those new slaves
 online and I have about 150 hosts still stuck in 'Pending' and about
 1300 services in the same state. In addition to that it seems that the
 service check results from the other slaves that were working normally
 are now arbitrarily disappearing. For example, on one host three of the
 service checks have been updated relatively recently (i.e. 5-30 minutes
 ago) but three other service checks haven't been updated for almost an
 hour. The slaves all appear operational and the hosts are being checked
 on time. Is it possible I've overwhelmed Nagios' ability to process data
 from the NSCA daemon or struck some internal Nagios bottleneck? Any
 suggestions would be appreciated.


With 4K servers and just over 24K service checks, with 12 or 13 distributed
servers.

Well, I've ran into many kinds of problems because of nagios poor design of
distributed monitoring setup.
Appears that distributed setup was done almost as a poor patch just to have
to overcome some limitation .

We ended up doing some custom passive plugins. They were built to send
status information updates just in case of state change. In that way the
load on NSCA side was very much reduced (it was Load Balanced with a Virtual
IP, batch updates, but problems would still occur). This set of plugins were
a little hard to mantain, because configuration of each server needed to be
at the monitored server, puppet ftw. All checks were logged and later
synchronized with ndo to have last checks history.

NDO and the database schema has had to be modified too. The volume of
inserts was way too high to be handled correctly in a timely manner,
recurrent restarts of the database causing staled results, every sort of
problem in managing those systems, even after a thorough tunning of the
database. After adding logic to update only when state change ocurred, and
another batch update to update last check and the fields that needed to be
updated with last check information, the database load was normalized and
scalability could be proven.

So what I'd suggest to you, is to first tweak with the large installation
procedures, tmpfs for the status.dat, objects.cache, retention.dat, setting
batch jobs to send_nsca output to central/master nagios instance, and so on.
Also, you can do some nagios setup magic aswell, having distributed nodes
checking in a frequency (normal_check_interval) different than central
nagios expects, say, setup central nagios to wait for status information on
30 minutes frequency, but have the distributed nodes to send them at 15
minutes freq., something like that.

For what I know, it's really a cumbersome job to have enterprise scalability
nagios configuration. For tiny and trivial installs it's like using Zennoss
or Zabbixx, but with a lot of extra configuration-files pain. I think that
no other competitor's tool (Z*bbnn*ssxx) would scale too when you need
enterprise huge installs, so nagios is a little ahead and gives flexibility,
but with an associated cost that scares anyone (ending up buying another
tool to much less for much more).

That's why I've liked Gabès Jean's Shinken approach to have scalability and
to ease interoperability with puppet. That would be the
übber-super-mega-ultra tool. Also, with nginx and asynchronicity of
front-end, back-end, and checks, would end up with the most robust, easy,
enterprise NMS.

So, Gèan, continue on that path to have your Shinken working with
backcompatibility with nagios setups, but also think ahead on design to have
puppet integrated to handle configuration convergence (maybe eventhandlers
too?).

Cheers,
M
--
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

[Nagios-users] check_snmp with regular expression

2009-12-10 Thread shadih rahman
List,
   I am trying to use check_snmp plugin with the following regular
expression and I am getting an error, can someone point out what am I doing
wrong.  Thanks



/usr/lib64/nagios/plugins/check_snmp -H hostname -C community -o
.1.3.6.1.2.1.1.6.0 -r ^*.some string*$

Could Not Compile Regular Expressioncheck_snmp: Could not parse arguments




-- 
Cordially,
Shadhin Rahman
--
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] check_snmp with regular expression

2009-12-10 Thread Greg Pangrazio
did you mean
^*.some string.*$

notice the period before the second *

Greg Pangrazio
pangr...@gmail.com





On Thu, Dec 10, 2009 at 9:18 AM, shadih rahman shadhi...@gmail.com wrote:
 List,
    I am trying to use check_snmp plugin with the following regular
 expression and I am getting an error, can someone point out what am I doing
 wrong.  Thanks



 /usr/lib64/nagios/plugins/check_snmp -H hostname -C community -o
 .1.3.6.1.2.1.1.6.0 -r ^*.some string*$

 Could Not Compile Regular Expressioncheck_snmp: Could not parse arguments




 --
 Cordially,
 Shadhin Rahman

 --
 Return on Information:
 Google Enterprise Search pays you back
 Get the facts.
 http://p.sf.net/sfu/google-dev2dev

 ___
 Nagios-users mailing list
 Nagios-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/nagios-users
 ::: Please include Nagios version, plugin version (-v) and OS when reporting
 any issue.
 ::: Messages without supporting info will risk being sent to /dev/null


--
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] check_snmp with regular expression

2009-12-10 Thread Martin Melin
It looks like you're trying to match some string, no matter where it
appears in the document. In that case, anchoring to line beginning and end
is just extra work. Simply match on some string, and you're good to go.

The asterisk is a modifier to the dot, so it needs to come after that. So
the regex you pasted should probably be ^.*some string.*$, but this is
functionally equivalent to some string.

Regards,
Martin Melin

On Thu, Dec 10, 2009 at 4:18 PM, shadih rahman shadhi...@gmail.com wrote:

 List,
I am trying to use check_snmp plugin with the following regular
 expression and I am getting an error, can someone point out what am I doing
 wrong.  Thanks



 /usr/lib64/nagios/plugins/check_snmp -H hostname -C community -o
 .1.3.6.1.2.1.1.6.0 -r ^*.some string*$

 Could Not Compile Regular Expressioncheck_snmp: Could not parse arguments




 --
 Cordially,
 Shadhin Rahman


 --
 Return on Information:
 Google Enterprise Search pays you back
 Get the facts.
 http://p.sf.net/sfu/google-dev2dev

 ___
 Nagios-users mailing list
 Nagios-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/nagios-users
 ::: Please include Nagios version, plugin version (-v) and OS when
 reporting any issue.
 ::: Messages without supporting info will risk being sent to /dev/null

--
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] Nagios2 process overwhelmed by NSCA daemon?

2009-12-10 Thread Jonathan Call
Yes, Full Nagios is running on the slaves. They use OCP_daemon to pass on data 
to the central server since the NSCA client can't hack the load. They seem to 
be sending data properly to the NSCA daemon. 

Part of the issue I've tracked down to the status.cgi. The central server 
appears to be underpowered when it comes to both having Nagios process data AND 
have several people pounding out host/service status queries from the web 
interface. I will be adding another CPU to see if this helps, however I'm 
dismayed that Nagios on the central server doesn't seem to be reporting any 
errors, or indicating that there is any problem processing passive results. 
Nagios just starts to lose the data at a certain point.

Jonathan 

 -Original Message-
 From: Greg Pangrazio [mailto:pangr...@gmail.com]
 Sent: Thursday, December 10, 2009 7:26 AM
 To: Jonathan Call
 Cc: nagios-user Mailinglist
 Subject: Re: [Nagios-users] Nagios2 process overwhelmed by NSCA daemon?
 
 Are you running the full nagios on the slaves?  Do the checks seem
 to be working on those hosts?
 
 Greg Pangrazio
 pangr...@gmail.com
 
 
 
 
 
 On Wed, Dec 9, 2009 at 5:06 PM, Jonathan Call jc...@verio.net wrote:
  I recently added two new slaves to a distributed Nagios system. The
  central server now passively processes 17,000+ service checks on
 3000+
  servers.
 
  It's been over an hour and a half since I brought those new slaves
  online and I have about 150 hosts still stuck in 'Pending' and about
  1300 services in the same state. In addition to that it seems that
 the
  service check results from the other slaves that were working
 normally
  are now arbitrarily disappearing. For example, on one host three of
 the
  service checks have been updated relatively recently (i.e. 5-30
 minutes
  ago) but three other service checks haven't been updated for almost
 an
  hour. The slaves all appear operational and the hosts are being
 checked
  on time. Is it possible I've overwhelmed Nagios' ability to process
 data
  from the NSCA daemon or struck some internal Nagios bottleneck? Any
  suggestions would be appreciated.
 
  Jonathan
 
 
  This email message is intended for the use of the person to whom it
 has been sent, and may contain information that is confidential or
 legally protected. If you are not the intended recipient or have
 received this message in error, you are not authorized to copy,
 distribute, or otherwise use this message or its attachments. Please
 notify the sender immediately by return e-mail and permanently delete
 this message and any attachments. Verio, Inc. makes no warranty that
 this email is error or virus free.  Thank you.
 
  -
 -
  Return on Information:
  Google Enterprise Search pays you back
  Get the facts.
  http://p.sf.net/sfu/google-dev2dev
  ___
  Nagios-users mailing list
  Nagios-users@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/nagios-users
  ::: Please include Nagios version, plugin version (-v) and OS when
 reporting any issue.
  ::: Messages without supporting info will risk being sent to
 /dev/null
 


This email message is intended for the use of the person to whom it has been 
sent, and may contain information that is confidential or legally protected. If 
you are not the intended recipient or have received this message in error, you 
are not authorized to copy, distribute, or otherwise use this message or its 
attachments. Please notify the sender immediately by return e-mail and 
permanently delete this message and any attachments. Verio, Inc. makes no 
warranty that this email is error or virus free.  Thank you.

--
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


[Nagios-users] Nagios as a Service Resiliency Manager

2009-12-10 Thread Christopher McAtackney
Hi all,

I have a need to control an Active / Passive pair of components and
was wondering if anyone had tackled this problem with Nagios?

The scenario is as follows;

Host A has SERVICE_1 installed and running. Host B has SERVICE_2
installed, but not running.

The desired functionality is to detect when SERVICE_1 is not running
(or that Host A is down / unreachable), and then to start SERVICE_2 on
Host B.

I believe I can do this with Nagios by defining an event handler on
SERVICE_1 which will make the appropriate call to start SERVICE_2 on
Host B

Would it make sense to store the relationship between SERVICE_1 and
Host B / SERVICE_2 as a service macro, e.g.
$_SERVICE_PASSIVE_HOSTNAME, $_SERVICE_PASSIVE_SERVICENAME?

There are too many scenarios in which the SERVICE_1 might come back up
to try automate the switching off of SERVICE_2 I believe, e.g. if
someone pulled a network cable on Host A accidently, then plugged it
in 15 minutes later - during which time Nagios detects that it is down
and so starts up SERVICE_2. The user then plugs the network lead back
in and now we have two Active instances running - which is what we
specifically wanted to avoid. Even if Nagios detects that the primary
component is up, it's still too late because any Active / Active
overlap will cause problems for this particular application.

I can't think of any way to automate that side of things - but does
the general concept of having Nagios start up a Passive partner make
sense?

Thanks for any insight you have,

Chris

--
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


Re: [Nagios-users] Nagios as a Service Resiliency Manager

2009-12-10 Thread Marcel
Maybe this would help:
http://onlamp.com/onlamp/2006/05/25/self-healing-networks.html

On Thu, Dec 10, 2009 at 3:08 PM, Christopher McAtackney
crist...@gmail.comwrote:

 Hi all,

 I have a need to control an Active / Passive pair of components and
 was wondering if anyone had tackled this problem with Nagios?

 The scenario is as follows;

 Host A has SERVICE_1 installed and running. Host B has SERVICE_2
 installed, but not running.

 The desired functionality is to detect when SERVICE_1 is not running
 (or that Host A is down / unreachable), and then to start SERVICE_2 on
 Host B.

 I believe I can do this with Nagios by defining an event handler on
 SERVICE_1 which will make the appropriate call to start SERVICE_2 on
 Host B

 Would it make sense to store the relationship between SERVICE_1 and
 Host B / SERVICE_2 as a service macro, e.g.
 $_SERVICE_PASSIVE_HOSTNAME, $_SERVICE_PASSIVE_SERVICENAME?

 There are too many scenarios in which the SERVICE_1 might come back up
 to try automate the switching off of SERVICE_2 I believe, e.g. if
 someone pulled a network cable on Host A accidently, then plugged it
 in 15 minutes later - during which time Nagios detects that it is down
 and so starts up SERVICE_2. The user then plugs the network lead back
 in and now we have two Active instances running - which is what we
 specifically wanted to avoid. Even if Nagios detects that the primary
 component is up, it's still too late because any Active / Active
 overlap will cause problems for this particular application.

 I can't think of any way to automate that side of things - but does
 the general concept of having Nagios start up a Passive partner make
 sense?

 Thanks for any insight you have,

 Chris


 --
 Return on Information:
 Google Enterprise Search pays you back
 Get the facts.
 http://p.sf.net/sfu/google-dev2dev
 ___
 Nagios-users mailing list
 Nagios-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/nagios-users
 ::: Please include Nagios version, plugin version (-v) and OS when
 reporting any issue.
 ::: Messages without supporting info will risk being sent to /dev/null

--
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

[Nagios-users] obsessive acknowledgment processing

2009-12-10 Thread Cris Daniluk
Hi,

We are currently forwarding checks from multiple Nagios sites into a central
location to create a consolidated view for our operations team. Some sites
have their own operations teams as well who acknowledge issues from time to
time. I set up a contact attached to all services and created a simple
notification command that fires an external command on the central server.
This works great for checks with notifications enabled, but if notifications
are disabled for the service, it obviously does not forward the
acknowledement.

I looked for an obvious way to work around this but did not find one. Is
there anything that works similar to ocsp but includes acknowledgments?

Thanks,

Cris
--
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
___
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null