(first off, I'm moving this over to [email protected]
for development triage)

My understanding is that the heartbeat is mostly
for the UI to see when services have gone offline, since if we fail to
dispatch a job we'll just dispatch it out elsewhere.  So, is this
correct:

1. If we disable the heartbeat by not spawning a JobProducerHearbeat
thread will this break MH from functioning correctly?

Also, the install docs put the service registry everywhere (e.g.
http://opencast.jira.com/wiki/display/MH/Install+Across+Multiple+Servers+%28Trunk%29
 ).
This means when one host decides to unregister another, it all affects
the same underlying DB.  So worker1 could mark worker2 as offline just
because worker1 can't access worker2.  Seems unfortunate, since worker1
might be the one that is "offline" (but still has access to the shared
DB).

2. Are the docs incorrect?  Should we be installing a service registry
stub somewhere instead? Does a stub exist?

And, just to suggest a new model for the service registry and
online/offline.  If we have each component register themselves every 5
mins if they are not already registered, we should be able to get a
heartbeat-like effect.  This will be failure prone (e.g. a host might
think its available but for some reason it isn't), but if failure just
means a job can't be dispatched and it's put back in the queue, it
isn't a big concern.  In this way, hosts are marked as offline when
they fail to accept a new job, and they can reregister themselves at
any time when they think they are ready to accept new jobs.

The only issue that comes up then is a host going offline while it's
processing a job.  It will still show up as online (e.g. since it can't
write to the db), and since it is still working, no one will try and
push it a new job (so it will appear "stuck" instead of going
offline).  But a gracious use of job timeouts might be able to
compensate for this.

3. Is this a reasonable resolution?

Chris

On Fri, 30 Sep 2011 14:10:28 +0200
Tobias Wunden <[email protected]> wrote:

> Hi Chris,
> 
> we were able to reproduce the issue locally and are happy working
> with you to resolve it.
> 
> Thanks,
> Tobias
> 
> On 29.09.2011, at 22:47, Christopher Brooks <[email protected]>
> wrote:
> 
> > Hi,
> > 
> > Our machines constantly get marked as offline.  Seems like under
> > load the heartbeat isn't getting through (for whatever reason).
> > We're disabling the heartbeat on our local system to make up for
> > this.
> > 
> > Anyone else having these issues on a distributed deployment?
> > 
> > Looking for people who might also be running into this, to help test
> > potential patches for 1.2.1.
> > 
> > Chris
> > 
> > -- 
> > Christopher Brooks, BSc, MSc
> > ARIES Laboratory, University of Saskatchewan
> > 
> > Web: http://www.cs.usask.ca/~cab938
> > Phone: 1.306.966.1442
> > Mail: Advanced Research in Intelligent Educational Systems
> > Laboratory Department of Computer Science
> >     University of Saskatchewan
> >     176 Thorvaldson Building
> >     110 Science Place
> >     Saskatoon, SK
> >     S7N 5C9
> > _______________________________________________
> > Matterhorn-users mailing list
> > [email protected]
> > http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
> _______________________________________________
> Matterhorn-users mailing list
> [email protected]
> http://lists.opencastproject.org/mailman/listinfo/matterhorn-users



-- 
Christopher Brooks, BSc, MSc
ARIES Laboratory, University of Saskatchewan

Web: http://www.cs.usask.ca/~cab938
Phone: 1.306.966.1442
Mail: Advanced Research in Intelligent Educational Systems Laboratory
     Department of Computer Science
     University of Saskatchewan
     176 Thorvaldson Building
     110 Science Place
     Saskatoon, SK
     S7N 5C9
_______________________________________________
Matterhorn-users mailing list
[email protected]
http://lists.opencastproject.org/mailman/listinfo/matterhorn-users

Reply via email to