2009/7/3 Göran Krampe <[email protected]>: > Hi folks! > > We are writing an app using CouchDB where we tried to do some map/reduce to > calculate "period sums" for about 1000 different "accounts". This is fiscal > data btw, the system is meant to store detailed fiscal data for about 50000 > companies, for starters. :) > > The map function is trivial, it just emits a bunch of "accountNo, amount" > pairs with "month" as key. > > The reduce/rereduce take these and builds a dictionary (JSON object) with > "month-accountNo" as key (like "2009/10-2335" and the sum as the value. This > works fine, yes, it builds up a bit but there is a maximum of account > numbers and months so it doesn't grow out of control, so that is NOT the > issue.
There is *no reason ever* to build up a dictionary with more then a small handful of items in it. Eg it's ok if your dictionary has this fixed set of keys: count, total, stddev, avg. It's not OK to do what you are doing. This is what group_level is for. Rewrite your map reduce to be correct and then we can start talking about performance. I don't mean to be harsh but suggesting you have a performance problem here is like me complaining that my Ferrari makes a bad boat. Cheers, Chris > > Ok, here comes the punchline. When we dump the first 1000 docs using bulk, > which typically will amount to say 5000 emits - and we "touch" the view to > trigger it - it will be rather fast and behaves like this: > > - a single Erlang process runs and emits all values, then it does a bunch or > reduce on those values and finally it switches into rereduce mode and does > those and then you can see the dictionary "growing" a bit but never too > much. It is pretty fast, a second or two all in all. > > Fine. Them we dump the *next* 1000 docs into Couch and triggers the view > again. This time it behaves like this (believe it or not): > > - two Erlang processes get into play. It seems the same process as above > continues with emits (IIRC) but a second one starts doing reduce/rereduce > *while the first one is emitting*. Ouch. And to make it worse - the second > one seems to gradually "take over" until we only see 2-3 emits followed by > tons of rereduces (all the way up I guess for each emit). > > Sooo... evidently Couch decides to do stuff in parallell and starts doing > reduce/rereduce while emitting here. AFAIK this is not the behavior > described. The net effect is that the view update that took 1-2 seconds > suddenly takes 400 seconds or goes to a total crawl and never seems to end. > > By looking at the log it obviously processes ONE doc at a time - giving us > 2-5 emits typically and then tries to reduce that all the way up to the root > before processing the next doc. So the rereduces for the internal nodes will > be run typically in this case 1000x more than needed. > > Phew. :) Ok, so we are basically hosed with this behavior in this situation. > I can only presume this has gone unnoticed because: > > a) Updates most of us do are small. But we dump thousands of new docs using > bulk (a full new fiscal year of data for a given company) so we definitely > notice it. > > b) Most reduce/rereduce functions are very, very fast. So it goes unnoticed. > Our functions are NOT that fast - but if they were only run as they should > (well, presuming they *should* only be run after all the emits for all doc > changes in a given view update) it would indeed be fast anyway. We can see > that since the first 1000 docs work fine. > > ...and thanks to the people on #couchdb for discussing this with me earlier > today and looking at the Erlang code to try to figure it out. I think Adam > Kocolski and Robert Newson had some idea about it. > > regards, Göran > > PS. I am on vacation now for 4 weeks, so I will not be answering much email. > I wanted to get this posted though since it is in some sense a rather ... > serious performance bottleneck. > > -- Chris Anderson http://jchrisa.net http://couch.io
