Hello Eric, On Tue, Sep 06, 2011 at 02:49:56PM -0700, eric smith wrote: > > I ran into an interesting situation with a workflow. Our project > manager built a workflow and was trying to do something that was > 'legal' in ruote but it ended up creating an endless rewind condition. > The net result was the rewind ran for about 6 hours creating 1.5 > million audit entries.
Ouch. > Obviously this was not his intent, besides telling him not to do that > again it brought me back to my old instrumentation questions. When we > see ruote break it is usually one of the following things. Let me still how I think each case should be handled. I understand the quick solutions I mention are not applicable in all the cases. > 1.) Somebody built a bad workflow. It should fail as soon as possible with an error logged. > 2.) A participant died in an unexpected way. An error should be logged. > 3.) A participant tried to do something that took a long time. Timeouts could help. > 4.) Someone, or something killed a worker while it was working. If it results in a workflow error then the workflow can be replayed at the error or the fauly branch can get re-applied. > 5.) We don't have enough workers running. The "engine" is visibly slow. > We currently use newrelic to let us peek into what the workers are > doing but that does not give us enough info. > > It is pretty easy for us to build a watchdog to govern the number of > history items that are created an shut them down if someone goes > crazy, but I was wondering if you had a better way. > > Also I was looking back at an old thread on fault tolerance and was > wondering if you have given any thought to this: > > http://groups.google.com/group/openwferu-users/browse_thread/thread/c51b94fb8bb685da/3750af5580163949?lnk=gst&q=best+practice#3750af5580163949 > > Specifically, letting workers 'talk' to the engine. Since the engine is the sum of workers, let's translate that to "the workers somehow write in the storage some info about their existence and their activity". I went back to this previous conversation. Here is what I extracted from one of your posts: | I think this type of problem will continue to cause issues around | fault tolerance and instrumentation. You should be able to ask the | engine how many workers are running, how many are consuming. You | should be able to pause or stop the workers. About fault tolerance, I can only preconize manual or "on_error" error handling, ie letting your administrators peak at the error list frequently enough. Quick general reminder (and teaser for ruote 2.2.1): Every ruote service that responds to the #on_msg(msg) method will see that method get called for each message the worker it lives with sucessfully processes. ---8<--- class ErrorNotifier def initialize(context, opts={}) @new_relic = NewRelic.new(...) end def on_msg return unless msg['action'] == 'error_intercepted' @new_relic.emit(msg) end end --->8--- | You should be able to ask the engine how many workers are running, | how many are consuming. So how about a "document" shared by all workers where they list: - hostname, pid - uptime - msgs processed during last week/day/hour/minute - timestamp (what am I missing ?) With a Engine#status method to query that document ? | You should be able to pause or stop the workers. engine.pause_workers, engine.resume_workers and engine.stop_workers ? Do you need to pause one specific worker or a specific set of workers ? Thanks for the reminder, the reporting feature is easy to add, but I had forgotten it. I was (am still) stuck on the "remotely pause/resume/stop workers" idea. -- John Mettraux - http://lambda.io/processi -- you received this message because you are subscribed to the "ruote users" group. to post : send email to [email protected] to unsubscribe : send email to [email protected] more options : http://groups.google.com/group/openwferu-users?hl=en
