RE: [ruote:3286] Change storage implementations in production and other questions :)

Nathan Stults Mon, 14 Nov 2011 09:14:02 -0800

John, thank you for all the pointers. Today we will set up a test
environment to take measurements and apply some realistic loads and take
a closer look at all the points you mentioned. One question on the
schedules - if the behavior of a worker is to pull all schedules and
fire triggered ones, how does this work in a multi-worker environment?
Is that what "reserve" is used for in the storage? (We haven't
implemented reserve in MongoDB, but probably should)

Once we get this sorted out I would love to work with you on the MongoDB
driver, that is a great suggestion.

Thanks again,

Nathan 

-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of John Mettraux
Sent: Sunday, November 13, 2011 5:15 PM
To: [email protected]
Subject: Re: [ruote:3282] Change storage implementations in production
and other questions :)

On Sat, Nov 12, 2011 at 03:32:31PM -0800, Nathan wrote:
>
> We had an experience this weekend where we attempted to roll out our 
> ruote-powered application to a new segment of users. However, we had 
> to roll back our efforts pretty quickly because our work item 
> processing started taking up to 10 minutes, particularly when creating

> new workflows (as opposed to advancing living workflows, even though 
> that slowed to a crawl as well). Our application serves work items in 
> real time via a  UI to users based on their actions and the workflow 
> definitions, so we are hoping for response times of a few seconds at 
> most.
>
> On Monday we are going to start picking things apart, trying to figure

> out what is wrong with our setup. We threw together the MongoDB 
> storage late last year and have been using it since, but we haven't 
> really load tested it, updated it for the latest version of Ruote, or 
> tried to make it work with multiple workers. We have noticed however 
> that it has a pretty high CPU utilization which has been growing over 
> time and now rests at over 50%.

Hello Nathan,

here are a few questions/ideas.

Ruote is polling for msgs and schedules. Could this be the cause ? An
engine with a FsStorage polls every at least twice per second, it
typically uses 0.3% of the CPU (mac osx).

You should try to determine why it's so busy. Try to stash together a
single file script involving the engine, the MongoDB storage and two or
three workflow runs and measure.

Maybe we could work together on the MongoDB storage (it could give me an
occasion to learn about MongoDB).

> Anyway, the first thing I want to try when troubleshooting this is 
> swapping out the MongoDB storage for another storage, preferably Redis

> based on speed. If that works well, then I know the culprit is our 
> storage adapter. Otherwise, I'll have to dig deeper. My main question 
> is this:
>
> * is there a reasonable way to migrate ruote from one storage to 
> another? I'd like to do our test on a copy of the production database.

All the storages have a #copy_to method whose default implementation is

https://github.com/jmettraux/ruote/blob/master/lib/ruote/storage/base.rb
#L242-267

Here is an example usage:

https://github.com/jmettraux/ruote/blob/master/test/functional/ft_42_sto
rage_copy.rb

> My next question is pretty broad, so I apologize for that, but
>
> * Are there any known performance bottlenecks or hot spots we should 
> be looking at? We will profile of course, but if there are some 
> obvious places to put the tip of the chisel that would be great to 
> know.

Well, one worker only processes one workflow operation at a time, so
when there is one worker only one workflow is alive at one time, with
the exception of the participant work (see
http://groups.google.com/group/openwferu-users/browse_thread/thread/e723
68cf72954cd9)

> Also,
>
> * I am guessing we have a number of workflows in the database that are

> "dead" or "orphaned" - workflows and processes that, due to exceptions

> or un-clean resets were never completed or cancelled. Could this 
> affect performance in a significant way? Should we routinely attempt 
> to clean out orphans?

If you use Engine#processes a lot then yes, the default implementation
lists all expressions and errors to build ProcessStatus instances. It
could be losing time on listing dead stuff.

> Currently our ruote database (in MongoDB) is 1.4GB with about 3K 
> schedules and 190K expressions.

3K schedules, It could be the problem. The current default
implementation of #get_schedules is

https://github.com/jmettraux/ruote/blob/master/lib/ruote/storage/base.rb
#L177-195

As you can see, a potential optimization is commented out. This dumb
implementation goes "let's fetch all the schedules in memory and then
filter the ones that have to get triggered now".

The solution would be, in ruote-mongodb, to override #get_schedules and
do the filtering in MongoDB itself so that those 3K schedules don't get
copied twice per second.

ruote-couch (sorry not the fastest) for example, has a dedicated
#get_schedules implementation.

I just checked ruote-redis and it hasn't any optimization (I have to do
something about that).

> Our workflows are pretty big - so each expression is fairly large in 
> size. Maybe much of this is cruft, not sure - but I'm curious how our 
> setup compares to others? Is this large, average, very small? Do any 
> of you have experience with DB sizes this big or bigger?

Which JSON library are you using ? I'm happy with yajl-ruby, it's faster
than all the others.

I personally only have experience with much smaller deployments. I'd
rate your shop as big. I wonder how David's Meego team compares.

> How long should it take to launch a workflow of substantial size?

The launch itself is only about copying the whole tree and the initial
workitem fields, it shouldn't be that time-consuming.

So, I'd investigate #get_schedules, an easy step is to measure how long
the call to get_schedules in lib/ruote/worker.rb is taking, and then do
something like "get_schedules = get all the schedules where at < now"

I hope this helps,

--
John Mettraux - http://lambda.io/processi

--
you received this message because you are subscribed to the "ruote
users" group.
to post : send email to [email protected] to unsubscribe
: send email to [email protected]
more options : http://groups.google.com/group/openwferu-users?hl=en

-- 
you received this message because you are subscribed to the "ruote users" group.
to post : send email to [email protected]
to unsubscribe : send email to [email protected]
more options : http://groups.google.com/group/openwferu-users?hl=en

RE: [ruote:3286] Change storage implementations in production and other questions :)

Reply via email to