On 2009-05-22 10:25, Ton Voon wrote:

> Can you model this nagios plugin on check_opsview_slave_cluster? Maybe
> call it check_opsview_slave_communication?

> For extra bonus points, you could create this service automatically on
> all slaves when the reverse_ssh flag is set.

OK, time to fess up - I couldn't write a perl script if my life
depended on it. I can just about read the work of others, but the
Llama book continues to gather dust on my bookshelf.

However, I have written an extraordinarily simple bash script which
after a fair bit of testing seems to function just fine. Unfortunately
it resides outside of Opsview but if anyone wishes to take the logic
and run with it then please feel free.

Anyhow, if you're interested here's what we now do to fix this issue...

1. Edit the retrieve_opsview_info script on the slave and add this
line just under use strict;

system ("/bin/date +%s > /usr/local/nagios/tmp/slave-check-time.txt");

(See, I can add bash into perl just fine !)

2. Add the following script which cron runs every minute:

#!/bin/bash
NOW=$(/bin/date +%s)
SLAVECHECK=$(/bin/cat /usr/local/nagios/tmp/slave-check-time.txt)
DIFFERENCE=$(($NOW-$SLAVECHECK))
[ $DIFFERENCE -gt 330 ] && /etc/init.d/opsview-slave restart &&
/bin/echo "Slave connection down! Re-starting opsview-slave service on
`hostname`..." | /bin/mail -s "opsview slave restart: `hostname`"
[email protected] && /bin/date +%s >
/usr/local/nagios/tmp/slave-check-time.txt && echo $NOW $SLAVECHECK
$DIFFERENCE

A frighteningly simple one liner that I'm sure could be ripped to
shreds / re-written much better but hey - it works.

So essentially...

1. The check from the master runs as normal and the timestamp on the
slave updates.

2. The script checks the time and as less that 330 seconds has passed
it does nothing.

3. The listening port on the master dies and the timestamp on the
slave no longer updates.

4. The script checks the time and more than 330 seconds has passed so it...

a. re-starts the opsview-slave process thus re-establishing the tunnel.

b. informs the ops team that this has happened.

c. updates the timestamp so it doesn't attempt another re-start before
the master can update the timestamp itself (the script runs every
minute whereas the master check runs every 5 minutes).

d. echo's the variables to stdout so we can check the user's mailbox
for the values if we wish (more for troubleshooting than anything
else).

And that's about it.

We've been testing it today by replicating the situation and it seems
to work just fine.

I hope this is of interest / help to anyone else who may experience this issue.
_______________________________________________
Opsview-users mailing list
[email protected]
http://lists.opsview.org/listinfo/opsview-users

Reply via email to