Steven, Matterhorn doesn't do anything with SSH, so if the machines are not under a really high load, SSH should respond. Can you check if the machines are under high load? How long does it take for a machine to get inaccessible, and is it fairly repeatable? Can you stay SSH'ed into the machine and just watch top to see if the load average jumps up?
Unless MH is causing a high load, I think this is unrelated to MH. (Ubuntu 10.10?) Chris On Fri, 30 Sep 2011 03:40:03 +0000 Steven M Lichti <[email protected]> wrote: > Chris, > > I'm having a problem sort of like this. My capture agents are > dropping off the air, and while they are marked as offline, they are > still inaccessible. I can ping them, but not ssh to them. I'm at a > complete loss as to why these machines stop responding. I've taken to > restarting them a couple of times per morning to make sure they're > alright, and that has seemed to help a bit. > > I've also checked the system log files, but haven't found anything > useful… > > --Steven. > > -- > Steven Lichti > Academic Technologies > Northwestern University > [email protected] > (847) 467-7805 > > > > From: Rubén Pérez <[email protected]<mailto:[email protected]>> > Reply-To: Matterhorn Users > <[email protected]<mailto:[email protected]>> > Date: Fri, 30 Sep 2011 01:38:53 +0200 To: Matterhorn Users > <[email protected]<mailto:[email protected]>> > Subject: Re: [Matterhorn-users] Heartburn > > Hi Chris, > > We do have the same problem around here and it have been driving us > crazy in our new pilot preliminary test. Can you elaborate on what > the "heartbeat" is? I understand it is some kind of "keep-alive" to > let the system know the machine is operative. What is the method you > used to disable it? > > Thanks for you answers. > > Best regards > Rubenciño > > 2011/9/29 Christopher Brooks > <[email protected]<mailto:[email protected]>> Hi, > > Our machines constantly get marked as offline. Seems like under load > the heartbeat isn't getting through (for whatever reason). We're > disabling the heartbeat on our local system to make up for this. > > Anyone else having these issues on a distributed deployment? > > Looking for people who might also be running into this, to help test > potential patches for 1.2.1. > > Chris > > -- > Christopher Brooks, BSc, MSc > ARIES Laboratory, University of Saskatchewan > > Web: http://www.cs.usask.ca/~cab938 > Phone: 1.306.966.1442 > Mail: Advanced Research in Intelligent Educational Systems Laboratory > Department of Computer Science > University of Saskatchewan > 176 Thorvaldson Building > 110 Science Place > Saskatoon, SK > S7N 5C9 > _______________________________________________ > Matterhorn-users mailing list > [email protected]<mailto:[email protected]> > http://lists.opencastproject.org/mailman/listinfo/matterhorn-users > > _______________________________________________ Matterhorn-users > mailing list > [email protected]<mailto:[email protected]> > http://lists.opencastproject.org/mailman/listinfo/matterhorn-users -- Christopher Brooks, BSc, MSc ARIES Laboratory, University of Saskatchewan Web: http://www.cs.usask.ca/~cab938 Phone: 1.306.966.1442 Mail: Advanced Research in Intelligent Educational Systems Laboratory Department of Computer Science University of Saskatchewan 176 Thorvaldson Building 110 Science Place Saskatoon, SK S7N 5C9 _______________________________________________ Matterhorn-users mailing list [email protected] http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
