Re: [ruote:3255] Rouge workflow

John Mettraux Tue, 06 Sep 2011 17:13:53 -0700

Hello Eric,

On Tue, Sep 06, 2011 at 02:49:56PM -0700, eric smith wrote:
>
> I ran into an interesting situation with a workflow. Our project
> manager built a workflow and was trying to do something that was
> 'legal' in ruote but it ended up creating an endless rewind condition.
> The net result was the rewind ran for about 6 hours creating 1.5
> million audit entries.


Ouch.

> Obviously this was not his intent, besides telling him not to do that
> again it brought me back to my old instrumentation questions. When we
> see ruote break it is usually one of the following things.

Let me still how I think each case should be handled. I understand the quick 
solutions I mention are not applicable in all the cases.

> 1.) Somebody built a bad workflow.

It should fail as soon as possible with an error logged.

> 2.) A participant died in an unexpected way.

An error should be logged.

> 3.) A participant tried to do something that took a long time.

Timeouts could help.

> 4.) Someone, or something killed a worker while it was working.

If it results in a workflow error then the workflow can be replayed at the 
error or the fauly branch can get re-applied.

> 5.) We don't have enough workers running.

The "engine" is visibly slow.

> We currently use newrelic to let us peek into what the workers are
> doing but that does not give us enough info.
>
> It is pretty easy for us to build a watchdog to govern the number of
> history items that are created an shut them down if someone goes
> crazy, but I was wondering if you had a better way.
>
> Also I was looking back at an old thread on fault tolerance and was
> wondering if you have given any thought to this:
>
> http://groups.google.com/group/openwferu-users/browse_thread/thread/c51b94fb8bb685da/3750af5580163949?lnk=gst&q=best+practice#3750af5580163949
>
> Specifically, letting workers 'talk' to the engine.

Since the engine is the sum of workers, let's translate that to "the workers 
somehow write in the storage some info about their existence and their 
activity".

I went back to this previous conversation. Here is what I extracted from one of 
your posts:

| I think this type of problem will continue to cause issues around
| fault tolerance and instrumentation. You should be able to ask the
| engine how many workers are running, how many are consuming. You
| should be able to pause or stop the workers.

About fault tolerance, I can only preconize manual or "on_error" error 
handling, ie letting your administrators peak at the error list frequently 
enough.

Quick general reminder (and teaser for ruote 2.2.1):

Every ruote service that responds to the #on_msg(msg) method will see that 
method get called for each message the worker it lives with sucessfully 
processes.

---8<---
class ErrorNotifier
  def initialize(context, opts={})
    @new_relic = NewRelic.new(...)
  end
  def on_msg
    return unless msg['action'] == 'error_intercepted'
    @new_relic.emit(msg)
  end
end
--->8---


| You should be able to ask the engine how many workers are running,
| how many are consuming.

So how about a "document" shared by all workers where they list:

- hostname, pid
- uptime
- msgs processed during last week/day/hour/minute
- timestamp

(what am I missing ?)

With a Engine#status method to query that document ?


| You should be able to pause or stop the workers.

engine.pause_workers, engine.resume_workers and engine.stop_workers ?

Do you need to pause one specific worker or a specific set of workers ?


Thanks for the reminder, the reporting feature is easy to add, but I had 
forgotten it. I was (am still) stuck on the "remotely pause/resume/stop 
workers" idea.

--
John Mettraux - http://lambda.io/processi

-- 
you received this message because you are subscribed to the "ruote users" group.
to post : send email to [email protected]
to unsubscribe : send email to [email protected]
more options : http://groups.google.com/group/openwferu-users?hl=en

Re: [ruote:3255] Rouge workflow

Reply via email to