On Sat, Nov 12, 2011 at 03:32:31PM -0800, Nathan wrote: > > We had an experience this weekend where we attempted to roll out our > ruote-powered application to a new segment of users. However, we had > to roll back our efforts pretty quickly because our work item > processing started taking up to 10 minutes, particularly when creating > new workflows (as opposed to advancing living workflows, even though > that slowed to a crawl as well). Our application serves work items in > real time via a UI to users based on their actions and the workflow > definitions, so we are hoping for response times of a few seconds at > most. > > On Monday we are going to start picking things apart, trying to figure > out what is wrong with our setup. We threw together the MongoDB > storage late last year and have been using it since, but we haven't > really load tested it, updated it for the latest version of Ruote, or > tried to make it work with multiple workers. We have noticed however > that it has a pretty high CPU utilization which has been growing over > time and now rests at over 50%.
Hello Nathan, here are a few questions/ideas. Ruote is polling for msgs and schedules. Could this be the cause ? An engine with a FsStorage polls every at least twice per second, it typically uses 0.3% of the CPU (mac osx). You should try to determine why it's so busy. Try to stash together a single file script involving the engine, the MongoDB storage and two or three workflow runs and measure. Maybe we could work together on the MongoDB storage (it could give me an occasion to learn about MongoDB). > Anyway, the first thing I want to try when troubleshooting this is > swapping out the MongoDB storage for another storage, preferably Redis > based on speed. If that works well, then I know the culprit is our > storage adapter. Otherwise, I'll have to dig deeper. My main question > is this: > > * is there a reasonable way to migrate ruote from one storage to > another? I'd like to do our test on a copy of the production database. All the storages have a #copy_to method whose default implementation is https://github.com/jmettraux/ruote/blob/master/lib/ruote/storage/base.rb#L242-267 Here is an example usage: https://github.com/jmettraux/ruote/blob/master/test/functional/ft_42_storage_copy.rb > My next question is pretty broad, so I apologize for that, but > > * Are there any known performance bottlenecks or hot spots we should > be looking at? We will profile of course, but if there are some > obvious places to put the tip of the chisel that would be great to > know. Well, one worker only processes one workflow operation at a time, so when there is one worker only one workflow is alive at one time, with the exception of the participant work (see http://groups.google.com/group/openwferu-users/browse_thread/thread/e72368cf72954cd9) > Also, > > * I am guessing we have a number of workflows in the database that are > "dead" or "orphaned" - workflows and processes that, due to exceptions > or un-clean resets were never completed or cancelled. Could this > affect performance in a significant way? Should we routinely attempt > to clean out orphans? If you use Engine#processes a lot then yes, the default implementation lists all expressions and errors to build ProcessStatus instances. It could be losing time on listing dead stuff. > Currently our ruote database (in MongoDB) is 1.4GB with about 3K > schedules and 190K expressions. 3K schedules, It could be the problem. The current default implementation of #get_schedules is https://github.com/jmettraux/ruote/blob/master/lib/ruote/storage/base.rb#L177-195 As you can see, a potential optimization is commented out. This dumb implementation goes "let's fetch all the schedules in memory and then filter the ones that have to get triggered now". The solution would be, in ruote-mongodb, to override #get_schedules and do the filtering in MongoDB itself so that those 3K schedules don't get copied twice per second. ruote-couch (sorry not the fastest) for example, has a dedicated #get_schedules implementation. I just checked ruote-redis and it hasn't any optimization (I have to do something about that). > Our workflows are pretty big - so each > expression is fairly large in size. Maybe much of this is cruft, not > sure - but I'm curious how our setup compares to others? Is this > large, average, very small? Do any of you have experience with DB > sizes this big or bigger? Which JSON library are you using ? I'm happy with yajl-ruby, it's faster than all the others. I personally only have experience with much smaller deployments. I'd rate your shop as big. I wonder how David's Meego team compares. > How long should it take to launch a workflow > of substantial size? The launch itself is only about copying the whole tree and the initial workitem fields, it shouldn't be that time-consuming. So, I'd investigate #get_schedules, an easy step is to measure how long the call to get_schedules in lib/ruote/worker.rb is taking, and then do something like "get_schedules = get all the schedules where at < now" I hope this helps, -- John Mettraux - http://lambda.io/processi -- you received this message because you are subscribed to the "ruote users" group. to post : send email to [email protected] to unsubscribe : send email to [email protected] more options : http://groups.google.com/group/openwferu-users?hl=en
