On Sat, Nov 12, 2011 at 03:32:31PM -0800, Nathan wrote:
>
> We had an experience this weekend where we attempted to roll out our
> ruote-powered application to a new segment of users. However, we had
> to roll back our efforts pretty quickly because our work item
> processing started taking up to 10 minutes, particularly when creating
> new workflows (as opposed to advancing living workflows, even though
> that slowed to a crawl as well). Our application serves work items in
> real time via a  UI to users based on their actions and the workflow
> definitions, so we are hoping for response times of a few seconds at
> most.
>
> On Monday we are going to start picking things apart, trying to figure
> out what is wrong with our setup. We threw together the MongoDB
> storage late last year and have been using it since, but we haven't
> really load tested it, updated it for the latest version of Ruote, or
> tried to make it work with multiple workers. We have noticed however
> that it has a pretty high CPU utilization which has been growing over
> time and now rests at over 50%.

Hello Nathan,

here are a few questions/ideas.

Ruote is polling for msgs and schedules. Could this be the cause ? An engine 
with a FsStorage polls every at least twice per second, it typically uses 0.3% 
of the CPU (mac osx).

You should try to determine why it's so busy. Try to stash together a single 
file script involving the engine, the MongoDB storage and two or three workflow 
runs and measure.

Maybe we could work together on the MongoDB storage (it could give me an 
occasion to learn about MongoDB).

> Anyway, the first thing I want to try when troubleshooting this is
> swapping out the MongoDB storage for another storage, preferably Redis
> based on speed. If that works well, then I know the culprit is our
> storage adapter. Otherwise, I'll have to dig deeper. My main question
> is this:
>
> * is there a reasonable way to migrate ruote from one storage to
> another? I'd like to do our test on a copy of the production database.

All the storages have a #copy_to method whose default implementation is

  
https://github.com/jmettraux/ruote/blob/master/lib/ruote/storage/base.rb#L242-267

Here is an example usage:

  
https://github.com/jmettraux/ruote/blob/master/test/functional/ft_42_storage_copy.rb

> My next question is pretty broad, so I apologize for that, but
>
> * Are there any known performance bottlenecks or hot spots we should
> be looking at? We will profile of course, but if there are some
> obvious places to put the tip of the chisel that would be great to
> know.

Well, one worker only processes one workflow operation at a time, so when there 
is one worker only one workflow is alive at one time, with the exception of the 
participant work (see 
http://groups.google.com/group/openwferu-users/browse_thread/thread/e72368cf72954cd9)

> Also,
>
> * I am guessing we have a number of workflows in the database that are
> "dead" or "orphaned" - workflows and processes that, due to exceptions
> or un-clean resets were never completed or cancelled. Could this
> affect performance in a significant way? Should we routinely attempt
> to clean out orphans?

If you use Engine#processes a lot then yes, the default implementation lists 
all expressions and errors to build ProcessStatus instances. It could be losing 
time on listing dead stuff.

> Currently our ruote database (in MongoDB) is 1.4GB with about 3K
> schedules and 190K expressions.

3K schedules, It could be the problem. The current default implementation of 
#get_schedules is

  
https://github.com/jmettraux/ruote/blob/master/lib/ruote/storage/base.rb#L177-195

As you can see, a potential optimization is commented out. This dumb 
implementation goes "let's fetch all the schedules in memory and then filter 
the ones that have to get triggered now".

The solution would be, in ruote-mongodb, to override #get_schedules and do the 
filtering in MongoDB itself so that those 3K schedules don't get copied twice 
per second.

ruote-couch (sorry not the fastest) for example, has a dedicated #get_schedules 
implementation.

I just checked ruote-redis and it hasn't any optimization (I have to do 
something about that).

> Our workflows are pretty big - so each
> expression is fairly large in size. Maybe much of this is cruft, not
> sure - but I'm curious how our setup compares to others? Is this
> large, average, very small? Do any of you have experience with DB
> sizes this big or bigger?

Which JSON library are you using ? I'm happy with yajl-ruby, it's faster than 
all the others.

I personally only have experience with much smaller deployments. I'd rate your 
shop as big. I wonder how David's Meego team compares.

> How long should it take to launch a workflow
> of substantial size?

The launch itself is only about copying the whole tree and the initial workitem 
fields, it shouldn't be that time-consuming.

So, I'd investigate #get_schedules, an easy step is to measure how long the 
call to get_schedules in lib/ruote/worker.rb is taking, and then do something 
like "get_schedules = get all the schedules where at < now"


I hope this helps,

--
John Mettraux - http://lambda.io/processi

-- 
you received this message because you are subscribed to the "ruote users" group.
to post : send email to [email protected]
to unsubscribe : send email to [email protected]
more options : http://groups.google.com/group/openwferu-users?hl=en

Reply via email to