> -----Original Message----- > From: [EMAIL PROTECTED] [mailto:nagios-users- > [EMAIL PROTECTED] On Behalf Of Trask > Sent: Wednesday, May 17, 2006 1:09 PM > To: nagios-users@lists.sourceforge.net > Subject: [Nagios-users] How to reduce a very high latency number > > I am still butting up against very high latency issues with my Nagios > setup. I feel like I must be missing something obvious because it > doesn't seem like I have so many services that the servers cannot keep > up. > > As can be seen from the data below, the server with the most service > checks has the highest latency (usually in the neighborhood of 700 > seconds! -- this is pre-production). Is my problem really this > simple? I have a feeling that is isn't just the number of checks, but > I cannot figure out why my latency values are so terrible. > > Overview of my setup: > > There are 4 servers. 3 distributed servers (nag1, nag2, nag3) at 3 > distinct geological locations send all their check information via > NSCA to a 4th, central server (nag4). The connections between all of > these servers are very high-bandwidth and are no where near saturated. > The only unclear spot to me is the effect that our hardware > VPN/tunnels might have, however the worst performing server (nag2) is > on the same LAN as the central server (nag4). > > Nagios v2.2, latest plugins and NRPE/NSCA as of today. I am running > embedded perl with perlcache enabled. > > > Number of hosts/services: > nag1: 43/130 > nag2: 193/1743 > nag3: 78 / 780 > nag4: (central server - active host checks, passive srvc checks) > > Performance Info: > > nag1: > Metric Min Max > Average > Check Execution Time: 0.00 sec 20.04 sec 0.024 sec > Check Latency: 0.00 sec 1.01 sec 0.011 sec > Percent State Change: 0.00 % 17.17 % 0.01% > > nag2 > Check Execution Time: 0.00 sec 929.13 sec 1.246 sec > Check Latency: 0.00 sec 1180.67 sec 560.462 sec > Percent State Change: 0.00% 55.59% 0.07% > > nag3: > Check Execution Time: 0.00 sec 101.70 sec 0.310 sec > Check Latency: 0.00 sec 602.57 sec 46.023 sec > Percent State Change: 0.00% 0.00% 0.00%
My first reaction is to question why some checks are taking >15 minutes to complete (check execution time) and why you are allowing them to go that long. I only allow a maximum of 60 seconds for any service check to execute -- (from nagios.cfg) service_check_timeout=60 host_check_timeout=30 event_handler_timeout=30 notification_timeout=30 ocsp_timeout=5 perfdata_timeout=5 Some comparable stats from my servers -- PIII 800/512MB 828 Service Checks - Check Execution Time: 0.13 sec 11.59 sec 7.984 sec Check Latency: 0.76 sec 15.54 sec 6.583 sec Percent State Change: 0.00% 6.25% 0.03% All active checks, load hangs out around 2. Another box, newer hardware, running nagios + cricket -- 2x Dual Core AMD Opteron Processor 275, 2GB RAM, 1260 service checks -- Check Execution Time: 0.04 sec 35.02 sec 6.675 sec Check Latency: 0.01 sec 38.16 sec 6.692 sec Percent State Change: 0.00% 9.47% 0.04% All active checks, load hangs out between 1 and 2. -- Marc ------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642 _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null