Re: [Nagios-users] High check latency in a machine with low load

Mike Guthrie Tue, 11 Oct 2011 07:28:28 -0700

If ndoutils starts to create a heavy burden on the system you can also 
offload ndoutils/mysql to a second machine.  We wrote the below document 
for Nagios XI, but the doc has the info you'd need to make it work for 
Nagios Core as well.


http://library.nagios.com/library/products/nagiosxi/documentation/462-offloading-mysql-to-remote-server



Javier Vela Diago wrote:
> I have a lot of custom checks, written mostly in perl, bash and some 
> in python. And some take a lo of time.
>
> Nevermind, I think I found the solution, or at least one part. I 
> configured to 1 the enable_large_instalallation_tweaks. This options, 
> 6 months ago, almost crashed my system, so i discarded it. Now, with 
> bigger problems, is the last thing that I wanted to test, but finally 
> this afternoon I tested  it.
>
> When I restarted Nagios, the load has started to grow until 6-8,  and 
> the latency problems dissapeared. I was sceptical about the utility of 
> this options but when the load changes form 2,5 to 6, it means that 
> the machine is doing a lot of work that before wasn't doing.
>
> Now the problem is that NDOUtils is causing  some latency because of 
> MYSQL, but well, at least I know what to optimize. Some tips will be 
> apreciated :)
>
> Thank you and sorry for your time.
>
>
> De:        Daniel Wittenberg <daniel.wittenberg.r...@statefarm.com>
> Para:        Nagios Users List <nagios-users@lists.sourceforge.net>
> Fecha:        11/10/2011 16:02
> Asunto:        Re: [Nagios-users] High check latency in a machine with 
> low load
> ------------------------------------------------------------------------
>
>
>
> I think you have the enable_high_latency option enabled J  j/k
>  
> Do you have any particular checks that are taking a long time?  i.e. 
> can you watch top and see checks taking a while?
>  
> Dan
>  
>  
> *From:* Javier Vela Diago [mailto:jv...@s2grupo.es] *
> Sent:* Tuesday, October 11, 2011 6:23 AM*
> To:* nagios-users@lists.sourceforge.net*
> Subject:* [Nagios-users] High check latency in a machine with low load
>  
> Hi,
>
> I have a Nagios 3.2.3 deployment with 1000+ Hosts and 3000+ services. 
> This Nagios runs together with NDO and PNP (in bulk mode) in a server 
> with 4GB of Ram and 4 cpus.
>
> One day I realized that the check delay in the performance CGI was 
> very high (300-400 seconds). It was very strange so I took the tunning 
> guide form nagios 
> (_http://nagios.sourceforge.net/docs/3_0/tuning.html_) and applied all 
> the points I could. In particular I adjusted the max_concurrent_checks 
> to zero (no limit):
>
> max_concurrent_checks=0
>
> The reaper event:
>
> service_reaper_frequency=5
> max_check_result_reaper_time=15
>
> and checked that the host checks where not forced. In addition I 
> configured 15 seconds of host check cache.
>
> cached_host_check_horizon=15
>
> But the problem remains. And the load of the server is not very high. 
> Load of 2,5, 2 GB of free memory and an average utilization of disc of 
> 7%. I disabled NDO and PNP but it was useless. After the first round 
> of checks, the delay returns, while the load of the server doesn't grow.
>
> I have searched in google but all the problems area because of the 
> load in the server, but here this is not the main problem. So my 
> question is ¿what can I do now?¿There is some variable that shows me 
> where to look? I'm a bit lost right now and I don't know how to find 
> the problem.
>
> ¿Or maybe the only way is to configure a master-slave nagios in order 
> to maximize the server utilization?
>
> In addition, I have pretty big timeouts (60 seconds) because of the 
> high latency on the network. All your help is appreciated. Thank you 
> in advance.
> *
> nagiostats*
> Nagios Stats 3.2.3
> Copyright (c) 2003-2008 Ethan Galstad (_www.nagios.org_)
> Last Modified: 10-03-2010
> License: GPL
>
> CURRENT STATUS DATA
> ------------------------------------------------------
> Status File:                           
>  /usr/local/argos/aplicaciones/nagios/var/status.dat
> Status File Age:                        0d 0h 0m 11s
> Status File Version:                    3.2.3
>
> Program Running Time:                   0d 20h 56m 7s
> Nagios PID:                             21834
> Used/High/Total Command Buffers:        0 / 0 / 4096
>
> Total Services:                         4032
> Services Checked:                       4032
> Services Scheduled:                     4030
> Services Actively Checked:              4032
> Services Passively Checked:             0
> Total Service State Change:             0.000 / 37.300 / 0.163 %
> Active Service Latency:                 32.876 / 442.138 / 415.816 sec
> Active Service Execution Time:          0.051 / 60.097 / 1.545 sec
> Active Service State Change:            0.000 / 37.300 / 0.163 %
> Active Services Last 1/5/15/60 min:     237 / 1530 / 4020 / 4020
> Passive Service Latency:                0.000 / 0.000 / 0.000 sec
> Passive Service State Change:           0.000 / 0.000 / 0.000 %
> Passive Services Last 1/5/15/60 min:    0 / 0 / 0 / 0
> Services Ok/Warn/Unk/Crit:              3766 / 38 / 44 / 184
> Services Flapping:                      0
> Services In Downtime:                   0
>
> Total Hosts:                            931
> Hosts Checked:                          931
> Hosts Scheduled:                        931
> Hosts Actively Checked:                 931
> Host Passively Checked:                 0
> Total Host State Change:                0.000 / 12.370 / 0.077 %
> Active Host Latency:                    0.000 / 441.308 / 416.063 sec
> Active Host Execution Time:             0.062 / 10.113 / 0.395 sec
> Active Host State Change:               0.000 / 12.370 / 0.077 %
> Active Hosts Last 1/5/15/60 min:        74 / 423 / 931 / 931
> Passive Host Latency:                   0.000 / 0.000 / 0.000 sec
> Passive Host State Change:              0.000 / 0.000 / 0.000 %
> Passive Hosts Last 1/5/15/60 min:       0 / 0 / 0 / 0
> Hosts Up/Down/Unreach:                  897 / 24 / 10
> Hosts Flapping:                         0
> Hosts In Downtime:                      1
>
> Active Host Checks Last 1/5/15 min:     109 / 535 / 1583
>   Scheduled:                           87 / 433 / 1300
>   On-demand:                           22 / 102 / 283
>   Parallel:                            87 / 438 / 1323
>   Serial:                              0 / 0 / 0
>   Cached:                              22 / 97 / 260
> Passive Host Checks Last 1/5/15 min:    0 / 0 / 0
> Active Service Checks Last 1/5/15 min:  304 / 1605 / 4924
>   Scheduled:                           304 / 1605 / 4923
>   On-demand:                           0 / 0 / 1
>   Cached:                              0 / 0 / 0
> Passive Service Checks Last 1/5/15 min: 0 / 0 / 0
>
> External Commands Last 1/5/15 min:      0 / 0 / 0
> *
> nagios -s*
>
> Nagios Core 3.2.3
> Copyright (c) 2009-2010 Nagios Core Development Team and Community 
> Contributors
> Copyright (c) 1999-2009 Ethan Galstad
> Last Modified: 10-03-2010
> License: GPL
>
> Website: _http://www.nagios.org_ <http://www.nagios.org/>
> Warning: aggregate_status_updates directive ignored.  All status file 
> updates are now aggregated.
> Warning: downtime_file variable ignored.  Downtime entries are now 
> stored in the status and retention files.
> Warning: comment_file variable ignored.  Comments are now stored in 
> the status and retention files.
> Timing information on object configuration processing is listed
> below.  You can use this information to see if precaching your
> object configuration would be useful.
>
> Object Config Source: Config files (uncached)
>
> OBJECT CONFIG PROCESSING TIMES      (* = Potential for precache 
> savings with -u option)
> ----------------------------------
> Read:                 0.080036 sec
> Resolve:              0.010660 sec  *
> Recomb Contactgroups: 0.002666 sec  *
> Recomb Hostgroups:    0.004086 sec  *
> Dup Services:         0.034632 sec  *
> Recomb Servicegroups: 0.001277 sec  *
> Duplicate:            0.010939 sec  *
> Inherit:              0.005594 sec  *
> Recomb Contacts:      0.000001 sec  *
> Sort:                 0.000000 sec  *
> Register:             0.074413 sec
> Free:                 0.008730 sec
>                      ============
> TOTAL:                0.234920 sec  * = 0.071741 sec (30.54%) 
> estimated savings
>
>
> RETENTION DATA TIMES
> ----------------------------------
> Read and Process:     0.495480 sec
>                      ============
> TOTAL:                0.495480 sec
>
>
> Timing information on configuration verification is listed below.
>
> CONFIG VERIFICATION TIMES          (* = Potential for speedup with -x 
> option)
> ----------------------------------
> Object Relationships: 0.060039 sec
> Circular Paths:       0.026557 sec  *
> Misc:                 0.005999 sec
>                      ============
> TOTAL:                0.092595 sec  * = 0.026557 sec (28.7%) estimated 
> savings
>
>
> EVENT SCHEDULING TIMES
> -------------------------------------
> Get service info:        0.014509 sec
> Get host info info:      0.002853 sec
> Get service params:      0.000078 sec
> Schedule service times:  0.039947 sec
> Schedule service events: 0.034656 sec
> Get host params:         0.000001 sec
> Schedule host times:     0.007519 sec
> Schedule host events:    0.029519 sec
>                         ============
> TOTAL:                   0.129082 sec
>
>
> Projected scheduling information for host and service checks
> is listed below.  This information assumes that you are going
> to start running Nagios with your current config files.
>
> HOST SCHEDULING INFORMATION
> ---------------------------
> Total hosts:                     931
> Total scheduled hosts:           931
> Host inter-check delay method:   SMART
> Average host check interval:     259.01 sec
> Host inter-check delay:          0.28 sec
> Max host check spread:           30 min
> First scheduled check:           Tue Oct 11 13:14:08 2011
> Last scheduled check:            Tue Oct 11 13:18:26 2011
>
>
> SERVICE SCHEDULING INFORMATION
> -------------------------------
> Total services:                     4032
> Total scheduled services:           4030
> Service inter-check delay method:   SMART
> Average service check interval:     299.55 sec
> Inter-check delay:                  0.07 sec
> Interleave factor method:           SMART
> Average services per host:          4.33
> Service interleave factor:          5
> Max service check spread:           30 min
> First scheduled check:              Tue Oct 11 13:15:07 2011
> Last scheduled check:               Tue Oct 11 13:20:07 2011
>
>
> CHECK PROCESSING INFORMATION
> ----------------------------
> Check result reaper interval:       5 sec
> Max concurrent service checks:      Unlimited
>
>
> PERFORMANCE SUGGESTIONS
> -----------------------
> I have no suggestions - things look okay.
> -- 
> Javier Vela Diago
> S2 GRUPO
> Ramiro de Maeztu, 7 bajo. 46022 Valencia
> Tel: 963.110.300 Fax: 963.106.086
> e-mail : jvela arroba s2grupo punto es_
> __http://www.s2grupo.es_ 
> <http://www.s2grupo.es/>------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2d-oct_______________________________________________
> Nagios-users mailing list
> Nagios-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when 
> reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
> ------------------------------------------------------------------------
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2d-oct
> ------------------------------------------------------------------------
>
> _______________________________________________
> Nagios-users mailing list
> Nagios-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting 
> any issue. 
> ::: Messages without supporting info will risk being sent to /dev/null


-- 


Mike Guthrie
Technical Team
___
Nagios Enterprises, LLC
Email:  mguth...@nagios.com
Web:    www.nagios.com


------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
_______________________________________________
Nagios-users mailing list
Nagios-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting 
any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Re: [Nagios-users] High check latency in a machine with low load

Reply via email to