Greetings, A colleague of mine (poctum) and I ran into something like this while using nsca and have crafted a similar solution. We observed that send_nsca was sending only one result to the central Nagios server per connection. Testing revealed that send_nsca was capable of handling thousands of results per connection. Sending only one at a time was resulting in lots of dropped data because there were nominally about 5 results derived per second. We enabled aggregate_status_updates in the nagios.cfg file, but this yielded no improvement in the result submissions. BTW, this is Nagios-2.2 and nsca-2.6 on Solaris 10. Our workaround is a quick and dirty but efficient solution. It may not be as refined as trask's and relies on nuances of unix file handling algorithms to get the job done. That said, it's working perfectly for us. As this seems to work well, but may violate Ethan's design intentions, your feedback/input is requested. Deploy at your own risk.
Jacob Ritorto, Lead UNIX Server Operations Engineer InnovationsTech Here's our solution: 1) Altered last line in /opt/nagios/libexec/eventhandlers/submit_check_result thusly. It basically concatenates check results to a temp file. #/bin/printf "%s\t%s\t%s\t%s\n" "$1" "$2" "$return_code" "$4" | /opt/nagios/bin/send_nsca 172.16.x.x -c /opt/nagios/etc/send_nsca.cfg /bin/printf "%s\t%s\t%s\t%s\n" "$1" "$2" "$return_code" "$4" >> /opt/nagios/var/results.waiting 2) Created a daemon process called reap (managed by smf, but it has been up for a month so far, so may be ok as an init.d script) to pull aside the aforementioned temp file (results.waiting) every five seconds and send the bits off to the central Nagios server (note that original file is re-created immediately via step 1 above). This probably only works perfectly on unix & unix-like systems due to the nature of files hanging around intact until the last program referencing them has exited. It's been some time, but the last I checked, DOS/WINxxxx doesn't treat files this way. Here's the simple little reap daemon: # cat /opt/nagios/bin/reap #!/usr/bin/tcsh while (1) sleep 5 mv /opt/nagios/var/results.waiting /opt/nagios/var/results.sending cat /opt/nagios/var/results.sending | /opt/nagios/bin/send_nsca 172.16.x.x -c /opt/nagios/etc/send_nsca.cfg >/dev/null end Summary: Slave Nagios servers now store up check results in the temp file for 5 seconds, then they get shipped off to nsca on the central Nagios machine in one swoop instead of one-at-a-time. *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~ From: Trask <[EMAIL PROTECTED]> Re: How to reduce a very high latency number 2006-05-23 03:50 On 5/22/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
[EMAIL PROTECTED] schrieb am 17.05.2006 20:09:16: To me this is obviously a performance issue related to hardware. Your machines have way too few RAM. It is totally not possible to run 1800 checks on a 512MB machine in a timely manner.
I figured this out this past Saturday. It is not any lack of the hardware. I was seeing negligible load nor an excessive use of memory. No configuration change I made seemed to have any appreciable effect on the latency times I was getting. I ended up doing a "top" with 1 second intervals and just watching it for awhile. I noticed that sometimes there would be a good number of nagios processes 20-30-40 or so, but the majority of the time there were only 2, 3 or 4 processes. Although I do not know exactly *why* this was happening, it ends up the during the time where there was 2-4 processes running, 2 of them were always the"submit_passive_check" script and "send_nsca". It appears that this is being done serially (ie not in parallel) and ends up blocking subsequent checks until they are done. I would see these 2 processes running (with steadily increasing PIDs) for up to a minute and then a short-lived (4-5 seconds) "explosion" of nagios processes (service/host checks). After this flurry of activity, it would be another 60 seconds or so of just 2-4 processes. I resolved this problem by changing by "submit_passive_check" script. Below are some sample scripts, both old and new. The short of it is like this: Previously, the "submit_passive_check" script did a printf of the data in the appropriate format and piped it to the "send_nsca" command (in a shell script). I have eliminated this bottleneck by having "submit_passive_check" redirect its output to a named pipe and then having another script feed "send_nsca" with that data as it comes in to the named pipe. Latency times have dropped from the 600-700 seconds to 0.2 seconds on the worst server and from 45-55 seconds to 0.06 on the 2nd to worst. That's more like it! Below are a few scripts w/ notes as to what each one is. Thanks to everyone who offered help. ~trask ------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642 _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null