(first off, I'm moving this over to [email protected] for development triage)
My understanding is that the heartbeat is mostly for the UI to see when services have gone offline, since if we fail to dispatch a job we'll just dispatch it out elsewhere. So, is this correct: 1. If we disable the heartbeat by not spawning a JobProducerHearbeat thread will this break MH from functioning correctly? Also, the install docs put the service registry everywhere (e.g. http://opencast.jira.com/wiki/display/MH/Install+Across+Multiple+Servers+%28Trunk%29 ). This means when one host decides to unregister another, it all affects the same underlying DB. So worker1 could mark worker2 as offline just because worker1 can't access worker2. Seems unfortunate, since worker1 might be the one that is "offline" (but still has access to the shared DB). 2. Are the docs incorrect? Should we be installing a service registry stub somewhere instead? Does a stub exist? And, just to suggest a new model for the service registry and online/offline. If we have each component register themselves every 5 mins if they are not already registered, we should be able to get a heartbeat-like effect. This will be failure prone (e.g. a host might think its available but for some reason it isn't), but if failure just means a job can't be dispatched and it's put back in the queue, it isn't a big concern. In this way, hosts are marked as offline when they fail to accept a new job, and they can reregister themselves at any time when they think they are ready to accept new jobs. The only issue that comes up then is a host going offline while it's processing a job. It will still show up as online (e.g. since it can't write to the db), and since it is still working, no one will try and push it a new job (so it will appear "stuck" instead of going offline). But a gracious use of job timeouts might be able to compensate for this. 3. Is this a reasonable resolution? Chris On Fri, 30 Sep 2011 14:10:28 +0200 Tobias Wunden <[email protected]> wrote: > Hi Chris, > > we were able to reproduce the issue locally and are happy working > with you to resolve it. > > Thanks, > Tobias > > On 29.09.2011, at 22:47, Christopher Brooks <[email protected]> > wrote: > > > Hi, > > > > Our machines constantly get marked as offline. Seems like under > > load the heartbeat isn't getting through (for whatever reason). > > We're disabling the heartbeat on our local system to make up for > > this. > > > > Anyone else having these issues on a distributed deployment? > > > > Looking for people who might also be running into this, to help test > > potential patches for 1.2.1. > > > > Chris > > > > -- > > Christopher Brooks, BSc, MSc > > ARIES Laboratory, University of Saskatchewan > > > > Web: http://www.cs.usask.ca/~cab938 > > Phone: 1.306.966.1442 > > Mail: Advanced Research in Intelligent Educational Systems > > Laboratory Department of Computer Science > > University of Saskatchewan > > 176 Thorvaldson Building > > 110 Science Place > > Saskatoon, SK > > S7N 5C9 > > _______________________________________________ > > Matterhorn-users mailing list > > [email protected] > > http://lists.opencastproject.org/mailman/listinfo/matterhorn-users > _______________________________________________ > Matterhorn-users mailing list > [email protected] > http://lists.opencastproject.org/mailman/listinfo/matterhorn-users -- Christopher Brooks, BSc, MSc ARIES Laboratory, University of Saskatchewan Web: http://www.cs.usask.ca/~cab938 Phone: 1.306.966.1442 Mail: Advanced Research in Intelligent Educational Systems Laboratory Department of Computer Science University of Saskatchewan 176 Thorvaldson Building 110 Science Place Saskatoon, SK S7N 5C9 _______________________________________________ Matterhorn-users mailing list [email protected] http://lists.opencastproject.org/mailman/listinfo/matterhorn-users
