I never did have anyone offering feedback on how to address a reverse ssh slave server dropping the reverse tunnel.
The remote site is able to pass info back to the master server but the master server cannot initiate a connection with the slave server. The slave server doesn't really have a way to know that the reverse tunnel is down but so far a restart of the remote server opsview-slave service seems to resolve it. Since the slave server can poll the master server even when the master cant connect to the slave I am thinking about - generating a command file the slave server would look for to take actions. - maybe create an event handler for the check slave service that would write a restart opsview-slave command to a special file - The slave server would check for the command file and take actions defined? Has anyone else out there had any issues with remote sites dropping the reverse ssh connection? Did anyone come up with creative ways to address this sort of condition? I really don't think I've tried to create a event handler before but based on what I've heard that sounds like the way to go. James Whittington VC3, Inc. From: James Whittington Sent: Tuesday, May 05, 2009 8:40 AM To: 'Opsview Users' Subject: Periodic issue with reverse ssh dropping on Opsview I am trying to chase down a couple of periodic issues with Running the reverse ssh Master/Slave setup of Opsview. I'm having some time drift issues running Ubuntu under VMWare ESX but running ntp with a good set of time servers has taken care of the issue for the most part. The main issue I still haven't figured out is some sites periodically drop the reverse tunnel and it autossh doesn't seem to be able to reestablish it. I've had multiple times now where I will get a slave down notification from the master server yet checks from the remote server are still able to pass through. Connections from the master to slave however get a connection refused, so it's like the reverse tunnel can't be reestablished. On the slave server you will see autossh trying multiple times to keep ssh alive but the connection exits with a 255 status. Here is a log extract of the autossh error condition, at the end is a opsview-slave restart which fixes the problem. May 4 18:05:47 nms-site-A-s01 autossh[5935]: starting ssh (count 256) May 4 18:05:47 nms-site-A-s01 autossh[5935]: ssh child pid is 24945 May 4 18:13:34 nms-site-A-s01 autossh[5935]: ssh exited with error status 255; restarting ssh May 4 18:13:34 nms-site-A-s01 autossh[5935]: starting ssh (count 257) May 4 18:13:34 nms-site-A-s01 autossh[5935]: ssh child pid is 26979 May 4 18:15:14 nms-site-A-s01 autossh[5935]: ssh exited with error status 255; restarting ssh May 4 18:15:14 nms-site-A-s01 autossh[5935]: starting ssh (count 258) May 4 18:15:14 nms-site-A-s01 autossh[5935]: ssh child pid is 27597 May 4 21:50:35 nms-site-A-s01 autossh[5935]: received signal to exit (15) May 4 21:50:40 nms-site-A-s01 autossh[26862]: port set to 0, monitoring disabled May 4 21:50:40 nms-site-A-s01 autossh[26863]: starting ssh (count 1) May 4 21:50:40 nms-site-A-s01 autossh[26863]: ssh child pid is 26864 I would like to fix or workaround this problem. Issues are: - when the condition occurs the master knows of the problem but can't send a restart command to the slave - the slave might not know anything is wrong to take any action to fix - if the slave could detect the error condition maybe a event handler would restart the opsview-slave service? Anyway any suggestions would be appreciated, I would like the monitoring system to be somewhat self-healing unless true downtime is occuring. This is a case where 24/7 engineers would get paged in the middle of the night for something that is not true downtime of a site. James Whittington VC3, Inc.
_______________________________________________ Opsview-users mailing list [email protected] http://lists.opsview.org/listinfo/opsview-users
