[ruote:3256] Re: Rouge workflow

Eric Smith Thu, 08 Sep 2011 09:56:19 -0700

> engine.pause_workers, engine.resume_workers and engine.stop_workers ?
Yes, that would be great!


Combined with :

>So how about a "document" shared by all workers where they list:

>- hostname, pid
>- uptime
>- msgs processed during last week/day/hour/minute
>- timestamp

You could add 'status' to the document to know that the status was
paused, stopped or running.

Would you want to stop all workers, or each worker? ( For our use case
stop all is sufficient )

It would be nice to know how long the workeritem was waiting around
for a worker to get to it. That might be as meaningful as the number
of processes.

Thanks
Eric Smith











On Sep 6, 7:13 pm, John Mettraux <[email protected]> wrote:
> Hello Eric,
>
> On Tue, Sep 06, 2011 at 02:49:56PM -0700, eric smith wrote:
>
> > I ran into an interesting situation with a workflow. Our project
> > manager built a workflow and was trying to do something that was
> > 'legal' in ruote but it ended up creating an endless rewind condition.
> > The net result was the rewind ran for about 6 hours creating 1.5
> > million audit entries.
>
> Ouch.
>
> > Obviously this was not his intent, besides telling him not to do that
> > again it brought me back to my old instrumentation questions. When we
> > see ruote break it is usually one of the following things.
>
> Let me still how I think each case should be handled. I understand the quick 
> solutions I mention are not applicable in all the cases.
>
> > 1.) Somebody built a bad workflow.
>
> It should fail as soon as possible with an error logged.
>
> > 2.) A participant died in an unexpected way.
>
> An error should be logged.
>
> > 3.) A participant tried to do something that took a long time.
>
> Timeouts could help.
>
> > 4.) Someone, or something killed a worker while it was working.
>
> If it results in a workflow error then the workflow can be replayed at the 
> error or the fauly branch can get re-applied.
>
> > 5.) We don't have enough workers running.
>
> The "engine" is visibly slow.
>
> > We currently use newrelic to let us peek into what the workers are
> > doing but that does not give us enough info.
>
> > It is pretty easy for us to build a watchdog to govern the number of
> > history items that are created an shut them down if someone goes
> > crazy, but I was wondering if you had a better way.
>
> > Also I was looking back at an old thread on fault tolerance and was
> > wondering if you have given any thought to this:
>
> >http://groups.google.com/group/openwferu-users/browse_thread/thread/c...
>
> > Specifically, letting workers 'talk' to the engine.
>
> Since the engine is the sum of workers, let's translate that to "the workers 
> somehow write in the storage some info about their existence and their 
> activity".
>
> I went back to this previous conversation. Here is what I extracted from one 
> of your posts:
>
> | I think this type of problem will continue to cause issues around
> | fault tolerance and instrumentation. You should be able to ask the
> | engine how many workers are running, how many are consuming. You
> | should be able to pause or stop the workers.
>
> About fault tolerance, I can only preconize manual or "on_error" error 
> handling, ie letting your administrators peak at the error list frequently 
> enough.
>
> Quick general reminder (and teaser for ruote 2.2.1):
>
> Every ruote service that responds to the #on_msg(msg) method will see that 
> method get called for each message the worker it lives with sucessfully 
> processes.
>
> ---8<---
> class ErrorNotifier
>   def initialize(context, opts={})
>     @new_relic = NewRelic.new(...)
>   end
>   def on_msg
>     return unless msg['action'] == 'error_intercepted'
>     @new_relic.emit(msg)
>   end
> end
> --->8---
>
> | You should be able to ask the engine how many workers are running,
> | how many are consuming.
>
> So how about a "document" shared by all workers where they list:
>
> - hostname, pid
> - uptime
> - msgs processed during last week/day/hour/minute
> - timestamp
>
> (what am I missing ?)
>
> With a Engine#status method to query that document ?
>
> | You should be able to pause or stop the workers.
>
> engine.pause_workers, engine.resume_workers and engine.stop_workers ?
>
> Do you need to pause one specific worker or a specific set of workers ?
>
> Thanks for the reminder, the reporting feature is easy to add, but I had 
> forgotten it. I was (am still) stuck on the "remotely pause/resume/stop 
> workers" idea.
>
> --
> John Mettraux -http://lambda.io/processi

-- 
you received this message because you are subscribed to the "ruote users" group.
to post : send email to [email protected]
to unsubscribe : send email to [email protected]
more options : http://groups.google.com/group/openwferu-users?hl=en

[ruote:3256] Re: Rouge workflow

Reply via email to