Steven,

Matterhorn doesn't do anything with SSH, so if the machines are not
under a really high load, SSH should respond.  Can you check if the
machines are under high load?  How long does it take for a machine to
get inaccessible, and is it fairly repeatable?  Can you stay SSH'ed
into the machine and just watch top to see if the load average jumps up?

Unless MH is causing a high load, I think this is unrelated to MH.

(Ubuntu 10.10?)

Chris

On Fri, 30 Sep 2011 03:40:03 +0000
Steven M Lichti <[email protected]> wrote:

> Chris,
> 
> I'm having a problem sort of like this. My capture agents are
> dropping off the air, and while they are marked as offline, they are
> still inaccessible. I can ping them, but not ssh to them. I'm at a
> complete loss as to why these machines stop responding. I've taken to
> restarting them a couple of times per morning to make sure they're
> alright, and that has seemed to help a bit.
> 
> I've also checked the system log files, but haven't found anything
> useful…
> 
> --Steven.
> 
> --
> Steven Lichti
> Academic Technologies
> Northwestern University
> [email protected]
> (847) 467-7805
> 
> 
> 
> From: Rubén Pérez <[email protected]<mailto:[email protected]>>
> Reply-To: Matterhorn Users
> <[email protected]<mailto:[email protected]>>
> Date: Fri, 30 Sep 2011 01:38:53 +0200 To: Matterhorn Users
> <[email protected]<mailto:[email protected]>>
> Subject: Re: [Matterhorn-users] Heartburn
> 
> Hi Chris,
> 
> We do have the same problem around here and it have been driving us
> crazy in our new pilot preliminary test. Can you elaborate on what
> the "heartbeat" is? I understand it is some kind of "keep-alive" to
> let the system know the machine is operative. What is the method you
> used to disable it?
> 
> Thanks for you answers.
> 
> Best regards
> Rubenciño
> 
> 2011/9/29 Christopher Brooks
> <[email protected]<mailto:[email protected]>> Hi,
> 
> Our machines constantly get marked as offline.  Seems like under load
> the heartbeat isn't getting through (for whatever reason).  We're
> disabling the heartbeat on our local system to make up for this.
> 
> Anyone else having these issues on a distributed deployment?
> 
> Looking for people who might also be running into this, to help test
> potential patches for 1.2.1.
> 
> Chris
> 
> --
> Christopher Brooks, BSc, MSc
> ARIES Laboratory, University of Saskatchewan
> 
> Web: http://www.cs.usask.ca/~cab938
> Phone: 1.306.966.1442
> Mail: Advanced Research in Intelligent Educational Systems Laboratory
>     Department of Computer Science
>     University of Saskatchewan
>     176 Thorvaldson Building
>     110 Science Place
>     Saskatoon, SK
>     S7N 5C9
> _______________________________________________
> Matterhorn-users mailing list
> [email protected]<mailto:[email protected]>
> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
> 
> _______________________________________________ Matterhorn-users
> mailing list
> [email protected]<mailto:[email protected]>
> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users



-- 
Christopher Brooks, BSc, MSc
ARIES Laboratory, University of Saskatchewan

Web: http://www.cs.usask.ca/~cab938
Phone: 1.306.966.1442
Mail: Advanced Research in Intelligent Educational Systems Laboratory
     Department of Computer Science
     University of Saskatchewan
     176 Thorvaldson Building
     110 Science Place
     Saskatoon, SK
     S7N 5C9
_______________________________________________
Matterhorn-users mailing list
[email protected]
http://lists.opencastproject.org/mailman/listinfo/matterhorn-users

Reply via email to