On Wed, 2005-12-21 at 12:08 +0100, Rob Hassing wrote: > Hello all, Hi Rob,
> I'm trying to setup a distributed monitoring system. > At the start all looked fine too me, but now I'm having some problems on > not receiving all passive checks from other hosts. Distributed monitoring is waaay cool. :) The only thing that could lead to a issue is that CGIs that come with web-interface don't scale very well. Here we ended up with a MySQL storing status with NEB-module. We are now testing GroundWork's framework. It appears to fit our needs. Only the config files generator we developed in-house, to properly setup all distributed agents, storing all config on a database. > The machine is a Intel(R) Xeon(TM) CPU 2.40GHz system with 512 MB RAM. > The process info tells me this: > Time Frame Checks Completed > <= 1 minute: 51 (16.6%) > <= 5 minutes: 221 (71.8%) > <= 15 minutes: 255 (82.8%) > <= 1 hour: 260 (84.4%) > Since program start: 261 (84.7%) Here is what we have: <= 1 minute:2383 (21.3%) <= 5 minutes:6138 (54.7%) <= 15 minutes:8321 (74.2%) <= 1 hour:10138 (90.4%) Since program start: 10711 (95.5%) > So it's receiving less then 85% of all checks :( > There will be more passive checks to be send to this nagios server. > Do we need other hardware ? > Where do I need to look to solve this problem ? To avoid staled services, you need to setup freshness_threshold properly for your services. Here is your hint, setting up freshness_threshold is something a little strange as we need to wait for the packet to arrive with the check result, and the less services you have it configured, letting Nagios calculates it, the better. But it is the only thing to configure to avoid staling services results. We decided to make staled results to appear in an Unknown status, because this could be only some traffic issue along the packet way caused by backup/restore routines, high traffic load, among other things that could cause such staling. > The machines sending the passive check info are not too busy doing this, > the checks are seperated over three different servers. Here we have 11 distributed servers, sending check results via send_nsca and they have around 2k services configured at each one. All sparc servers sending to a SuSE9.3 box on commoditie hardware. This linux machine has 2GRAM, and some SATA disks. It is a P4-HT. > One example... > This is /var/log/nagios/nagios.log: > [1135162484] EXTERNAL COMMAND: > PROCESS_SERVICE_CHECK_RESULT;cat29-w11-backup;PING;0;PING OK - Packet loss > = 0%, RTA = 0.89 ms[1135162491] SERVICE ALERT: > cat29-w11-backup;PING;OK;HARD;3;PING OK - Packet loss = 0%, RTA = 0.89 ms > [1135162491] SERVICE NOTIFICATION: > nagios;cat29-w11-backup;PING;OK;notify-by-epager;PING OK - Packet loss = > 0%, RTA = 0.89 ms[1135162491] SERVICE NOTIFICATION: > nagios;cat29-w11-backup;PING;OK;notify-by-email;PING OK - Packet loss = > 0%, RTA = 0.89 ms > [1135162941] Warning: The results of service 'PING' on host > 'cat29-w11-backup' are stale by 32 seconds (threshold=425 seconds). I'm > forcing an immediate check of the service. > [1135162951] SERVICE ALERT: > cat29-w11-backup;PING;CRITICAL;SOFT;1;CRITICAL: Service results are stale! > > It looks like its stale again too fast ? Well, those last two lines don't indicate two staled services. The first line which tells you the freshness_threshold indicates that Central Nagios waited for 425 seconds and the result of the Active check arrived 32 seconds later. The last line, is indicating the Active Check being processed by Central Nagios. Then it appears as a critical alert on web-interface. The active check stale_service.sh or whatever line you place there is processed. (it can be the real check, thus Central Nagios will be actively checking on staled results, but this will cause some load troubles :) HTH && Regards, -- Marcel Mitsuto Fucatu Sugano <[EMAIL PROTECTED]> Universo Online S.A. -- http://www.uol.com.br ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click _______________________________________________ Nagios-users mailing list Nagios-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nagios-users ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null