Hello all,

I ran into an interesting situation with a workflow. Our project
manager built a workflow and was trying to do something that was
'legal' in ruote but it ended up creating an endless rewind condition.
The net result was the rewind ran for about 6 hours creating 1.5
million audit entries.

Obviously this was not his intent, besides telling him not to do that
again it brought me back to my old instrumentation questions. When we
see ruote break it is usually one of the following things.

1.) Somebody built a bad workflow.
2.) A participant died in an unexpected way.
3.) A participant tried to do something that took a long time.
4.) Someone, or something killed a worker while it was working.
5.) We don't have enough workers running.

We currently use newrelic to let us peek into what the workers are
doing but that does not give us enough info.

It is pretty easy for us to build a watchdog to govern the number of
history items that are created an shut them down if someone goes
crazy, but I was wondering if you had a better way.

Also I was looking back at an old thread on fault tolerance and was
wondering if you have given any thought to this:

http://groups.google.com/group/openwferu-users/browse_thread/thread/c51b94fb8bb685da/3750af5580163949?lnk=gst&q=best+practice#3750af5580163949

Specifically, letting workers 'talk' to the engine.

Thanks
Eric Smith

-- 
you received this message because you are subscribed to the "ruote users" group.
to post : send email to [email protected]
to unsubscribe : send email to [email protected]
more options : http://groups.google.com/group/openwferu-users?hl=en

Reply via email to