> engine.pause_workers, engine.resume_workers and engine.stop_workers ? Yes, that would be great!
Combined with : >So how about a "document" shared by all workers where they list: >- hostname, pid >- uptime >- msgs processed during last week/day/hour/minute >- timestamp You could add 'status' to the document to know that the status was paused, stopped or running. Would you want to stop all workers, or each worker? ( For our use case stop all is sufficient ) It would be nice to know how long the workeritem was waiting around for a worker to get to it. That might be as meaningful as the number of processes. Thanks Eric Smith On Sep 6, 7:13 pm, John Mettraux <[email protected]> wrote: > Hello Eric, > > On Tue, Sep 06, 2011 at 02:49:56PM -0700, eric smith wrote: > > > I ran into an interesting situation with a workflow. Our project > > manager built a workflow and was trying to do something that was > > 'legal' in ruote but it ended up creating an endless rewind condition. > > The net result was the rewind ran for about 6 hours creating 1.5 > > million audit entries. > > Ouch. > > > Obviously this was not his intent, besides telling him not to do that > > again it brought me back to my old instrumentation questions. When we > > see ruote break it is usually one of the following things. > > Let me still how I think each case should be handled. I understand the quick > solutions I mention are not applicable in all the cases. > > > 1.) Somebody built a bad workflow. > > It should fail as soon as possible with an error logged. > > > 2.) A participant died in an unexpected way. > > An error should be logged. > > > 3.) A participant tried to do something that took a long time. > > Timeouts could help. > > > 4.) Someone, or something killed a worker while it was working. > > If it results in a workflow error then the workflow can be replayed at the > error or the fauly branch can get re-applied. > > > 5.) We don't have enough workers running. > > The "engine" is visibly slow. > > > We currently use newrelic to let us peek into what the workers are > > doing but that does not give us enough info. > > > It is pretty easy for us to build a watchdog to govern the number of > > history items that are created an shut them down if someone goes > > crazy, but I was wondering if you had a better way. > > > Also I was looking back at an old thread on fault tolerance and was > > wondering if you have given any thought to this: > > >http://groups.google.com/group/openwferu-users/browse_thread/thread/c... > > > Specifically, letting workers 'talk' to the engine. > > Since the engine is the sum of workers, let's translate that to "the workers > somehow write in the storage some info about their existence and their > activity". > > I went back to this previous conversation. Here is what I extracted from one > of your posts: > > | I think this type of problem will continue to cause issues around > | fault tolerance and instrumentation. You should be able to ask the > | engine how many workers are running, how many are consuming. You > | should be able to pause or stop the workers. > > About fault tolerance, I can only preconize manual or "on_error" error > handling, ie letting your administrators peak at the error list frequently > enough. > > Quick general reminder (and teaser for ruote 2.2.1): > > Every ruote service that responds to the #on_msg(msg) method will see that > method get called for each message the worker it lives with sucessfully > processes. > > ---8<--- > class ErrorNotifier > def initialize(context, opts={}) > @new_relic = NewRelic.new(...) > end > def on_msg > return unless msg['action'] == 'error_intercepted' > @new_relic.emit(msg) > end > end > --->8--- > > | You should be able to ask the engine how many workers are running, > | how many are consuming. > > So how about a "document" shared by all workers where they list: > > - hostname, pid > - uptime > - msgs processed during last week/day/hour/minute > - timestamp > > (what am I missing ?) > > With a Engine#status method to query that document ? > > | You should be able to pause or stop the workers. > > engine.pause_workers, engine.resume_workers and engine.stop_workers ? > > Do you need to pause one specific worker or a specific set of workers ? > > Thanks for the reminder, the reporting feature is easy to add, but I had > forgotten it. I was (am still) stuck on the "remotely pause/resume/stop > workers" idea. > > -- > John Mettraux -http://lambda.io/processi -- you received this message because you are subscribed to the "ruote users" group. to post : send email to [email protected] to unsubscribe : send email to [email protected] more options : http://groups.google.com/group/openwferu-users?hl=en
