I never did have anyone offering feedback on how to address a reverse
ssh slave server dropping the reverse tunnel.

The remote site is able to pass info back to the master server but the
master server cannot initiate a connection with the slave server.

 

The slave server doesn't really have a way to know that the reverse
tunnel is down but so far a restart of the remote server opsview-slave
service seems to resolve it.

Since the slave server can poll the master server even when the master
cant connect to the slave I am thinking about 

-          generating a command file the slave server would look for to
take actions.

-          maybe create an event handler for the check slave service
that would write a restart opsview-slave command to a special file

-          The slave server would check for the command file and take
actions defined? 

 

Has anyone else out there had any issues with remote sites dropping the
reverse ssh connection?

Did anyone come up with creative ways to address this sort of condition?

I really don't think I've tried to create a event handler before but
based on what I've heard that sounds like the way to go.

 

James Whittington

VC3, Inc.

 

From: James Whittington 
Sent: Tuesday, May 05, 2009 8:40 AM
To: 'Opsview Users'
Subject: Periodic issue with reverse ssh dropping on Opsview

 

I am trying to chase down a couple of periodic issues with Running the
reverse ssh Master/Slave setup of Opsview.

I'm having some time drift issues running Ubuntu under VMWare ESX but
running ntp with a good set of time servers has taken care of the issue
for the most part.

 

The main issue I still haven't figured out is some sites periodically
drop the reverse tunnel and it autossh doesn't seem to be able to
reestablish it.

I've had multiple times now where I will get a slave down notification
from the master server yet checks from the remote server are still able
to pass through.

Connections from the master to slave however get a connection refused,
so it's like the reverse tunnel can't be reestablished.

On the slave server you will see autossh trying multiple times to keep
ssh alive but the connection exits with a 255 status.

 

Here is a log extract of the autossh error condition, at the end is a
opsview-slave restart which fixes the problem.

 

May  4 18:05:47 nms-site-A-s01 autossh[5935]: starting ssh (count 256)

May  4 18:05:47 nms-site-A-s01 autossh[5935]: ssh child pid is 24945

May  4 18:13:34 nms-site-A-s01 autossh[5935]: ssh exited with error
status 255; restarting ssh

May  4 18:13:34 nms-site-A-s01 autossh[5935]: starting ssh (count 257)

May  4 18:13:34 nms-site-A-s01 autossh[5935]: ssh child pid is 26979

May  4 18:15:14 nms-site-A-s01 autossh[5935]: ssh exited with error
status 255; restarting ssh

May  4 18:15:14 nms-site-A-s01 autossh[5935]: starting ssh (count 258)

May  4 18:15:14 nms-site-A-s01 autossh[5935]: ssh child pid is 27597

May  4 21:50:35 nms-site-A-s01 autossh[5935]: received signal to exit
(15)

May  4 21:50:40 nms-site-A-s01 autossh[26862]: port set to 0, monitoring
disabled

May  4 21:50:40 nms-site-A-s01 autossh[26863]: starting ssh (count 1)

May  4 21:50:40 nms-site-A-s01 autossh[26863]: ssh child pid is 26864

 

I would like to fix or workaround this problem.

Issues are:

-          when the condition  occurs the master knows of the problem
but can't send a restart command to the slave 

-          the slave might not know anything is wrong to take any action
to fix 

-          if the slave could  detect the error condition maybe a event
handler would restart the opsview-slave service?

 

Anyway any suggestions would be appreciated, I would like the monitoring
system to be somewhat self-healing unless true downtime is occuring.

This is a case where 24/7 engineers would get paged in the middle of the
night for something that is not true downtime of a site.

 

James Whittington

VC3, Inc.

 

 

_______________________________________________
Opsview-users mailing list
[email protected]
http://lists.opsview.org/listinfo/opsview-users

Reply via email to