ahh, makes sense now. I typically don't run colmux for long periods and so that much be why I haven't see that behavior before
I'm now wondering what the negatives of setting this is the default behavior might be as it seems like it'd be a good thing. If it does make more sense to not always set it I could always add something like --keepalive -mark On Sun, Mar 3, 2013 at 12:25 PM, Vishal Gupta <[email protected]> wrote: > Colmux is issuing an collectl command over SSH. After collectl is invoked on > the server/machine, there is no more communication over the SSH session. So > effectively these ssh sessions are idle, as there is no data/message/command > interchange between colmux and server over SSH channel. All the > communication happens over collectl port between colmux and servers. So if > your server is configured to disconnect the idle SSH session after a certain > pre-defined amount of idle duration, and server disconnects colmux's ssh > session to it. It results in colmux removing those servers from the output. > Please note disconnection was not due to collectl dying or server and colmux > client disappearing all together, either due to network glitch or due to > reboot/crashes. This disconnection is purely because of idle ssh session. We > can avoid this ssh connection timeout by changing either ClientAliveInterval > on ssh daemon on the server or by changing ServerAliveInterval on the ssh > client. Of course one may not want to change the ssh daemon setting on all > the corresponding server we are trying to connect to. It would even be > impractical to change this setting on all the servers. > > On the SSH client side (colmux side) also this setting can be changed in > either of the following location. > > /etc/ssh/ssh_config (please note its ssh not sshd file) > ~/.ssh/config > Command line parameter > > Again we may not want to change this setting for all the ssh connection > originating from client on which colmux is running. So it might be better to > pass this as the command line parameter and make it configuration in some > configuration file or via colmux switch. > > Regards, > Vishal Gupta > http://blog.vishalgupta.com > > > From: Mark Seger <[email protected]> > Date: Sunday, 3 March 2013 16:45 > To: Vishal Gupta <[email protected]> > > Cc: Collectl Interest <[email protected]> > Subject: Re: [Collectl-interest] colmux duplicating nodes > > interesting. I wasn't aware of this switch. But from the description > it sounds like this would take care of the situation where a remote > collectl goes away for over 5 minutes and I wasn't aware that can even > happen. Are you saying it can and does? Does this mean collectl > could go away for 4 minutes, time out and disconnect and this wouldn't > help that case? OR is the network timeout value 5 minutes? Just > trying to understand the exact mechanics of what is happening > -mark > > On Sun, Mar 3, 2013 at 9:44 AM, Vishal Gupta <[email protected]> wrote: > > Mark, > > Server disappearing from colmux output on Exadata cluster can be solved by > adding "-o ServerAliveInterval=300" to colmux ssh command. This will ensure > that a message is sent from client (colmux) to server (machines being > monitored) every 300sec over secure encrypted channel (hence not spoofable) > to ensure that ssh connection don't timeout. > > I have tested above by adding the in the ssh command variable. You may want > to include that in colmux source code itself. > > my $Ssh='/usr/bin/ssh -o StrictHostKeyChecking=no -o ServerAliveInterval=300 > '; > $Ssh.=" -q" unless $debug; > > > Vishal > > From: Vishal Gupta <[email protected]> > Date: Monday, 25 February 2013 11:49 > To: Vishal Gupta <[email protected]>, Mark Seger <[email protected]> > Cc: Collectl Interest <[email protected]> > Subject: Re: [Collectl-interest] colmux duplicating nodes > > Mark, > > I think my servers disappearing might be due to SSH timeout. > > From: Vishal Gupta <[email protected]> > Date: Wednesday, 24 October 2012 21:42 > To: Mark Seger <[email protected]> > Cc: Collectl Interest <[email protected]> > Subject: Re: [Collectl-interest] colmux duplicating nodes > > Mark, > > I don't think my servers disappearing from colmux is due to a network > glitch. On a Exadata, all the servers are connected via a internal Cisco IP > switch. There are also dedicated 3 infiniband switches. I have tried over > both Cisco IP switch and infiniband switch with --age=5 as well 10. But my > servers still disappear from the output after few hours. Is there i can do > to debug this? What level of debug do you recommend for debugging this? > > Regards, > Vishal Gupta > Blog | LinkedIn | Twitter > > -----Original Message----- > > From: Mark Seger <[email protected]> > > Subject: Re: [Collectl-interest] colmux duplicating nodes > > Date: 20 October 2012 12:19:08 BST > > To: Vishal Gupta <[email protected]> > > Cc: [email protected] > > > > > > On Fri, Oct 19, 2012 at 4:16 PM, Vishal Gupta <[email protected]> > wrote: > > > I am using colmux on a Oracle Exadata Machine full rack with linux hosts > (OEL 5.7), if colmux is left running for few hours it starts showing > duplicate lines for server in the output. > > > > are you using the latest version [3.2.0]? I do remember seeing that in an > earlier version and I thought I fixed it. I'm really hoping it's not still > there because it can be pretty painful to track down or even reproduce. The > way colmux works is it asynchronously receives/stores data from each remote > host and at the same time fires a timer every monitoring interval. Colmux > then displays the late value it's seen for each entry. Sounds simple > enough but it turned of the incoming data was occasionally overwriting the > data from the previous samples. My solution was to double-buffer the data, > reading from one dataset while writing to a new one. I'm just hoping I > don't need to dig back into it. > > > Also i noticed that some of the hosts are automatically completely removed > from the output. Is there some kind of timeout configured in colmux or > collectl which might remove the server entries from the output over time. > > > > unfortunately the way colmux works is if it doesn't hear from a remote > server in x-seconds (which you can set via --age) it drops it from the list > and doesn't try to reconnect. as for the age, you don't want to make it too > long or else a server could disconnect and you'd never know it and keep > displaying stale data. I suppose on a glitchy network you could end up > having to wait a little longer. Maybe you could try upping it to 5 or 10 > and see if that helps OR if the remote machine really did drop the link. > > you're not the first to ask about reconnecting when a host drops... > > -mark > > > Regards, > Vishal Gupta > Blog | LinkedIn | Twitter > > > > ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb _______________________________________________ Collectl-interest mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/collectl-interest
