Perhaps we need a section of the wiki devoted to these kinds of best practices.
Would someone with a strong understanding of this issue care to elucidate? On Thu, Apr 9, 2009 at 2:45 PM, Damien Katz <[email protected]> wrote: > > On Apr 9, 2009, at 11:17 AM, Paul Davis wrote: > >> Kenneth, >> >> I'm pretty sure you're issue is in the reduce steps for the daily and >> montly views. The general rule of thumb is that you shouldn't be >> returning data that grows faster than log(#keys processed) where as I >> believe your data is growing linearly with input. >> >> This particular limitation is a result of the implementation of >> incremental reductions. Basically, each key/pointer pair stores the >> re-reduced value for all [re-]reduce values in its children nodes. So >> as your reduction moves up the tree the data starts exploding which >> kills btree performance not to mention the extra file I/O. >> >> The basic moral of the story is that if you want reduce views like >> this per user you should emit a [user_id, date] pair as the key and >> then call your reduce views with group=true. > > +1 Paul. > > New users hit this problem a lot, and since it's manifests as a performance > problem, users spend more time than necessary trying to figure out what's > wrong. I wonder if there is something we can do to make it more obvious when > reduce is used incorrectly? Perhaps a limit (say 1k) on the size of the > reduce value, and when it's exceeded a "reduce value to large" error is > generated. In process of investigating the error they'll be more likely find > the documentation that explains what they doing wrong. > > Moving this discussion to d...@. Anyone else have any thoughts or ideas? > > -Damien > >> >> HTH, >> Paul Davis >> >> On Thu, Apr 9, 2009 at 10:25 AM, Kenneth Kalmer >> <[email protected]> wrote: >>> >>> Hi everyone >>> >>> After months of lurking and reading up on couch I finally got the time to >>> start using it for an internal mail log analyzer. I parse the logs from >>> our >>> Courier-IMAP installation and convert the different lines into documents >>> and >>> this has proven to work quite well. >>> >>> My first task is to extract some metrics from these docs regarding how >>> oftern people "pop" their mail, and the returned sizes of each "pop". >>> Documents in question look like this: >>> >>> { >>> "_id": "0000f68e73f3521f3ee8b3b51e0101d7", >>> "_rev": "1-3732031452", >>> "user": "[email protected]", >>> "host": "pop-5", >>> "time": "2009/03/13 05:47:08 +0000", >>> "action": "LOGOUT", >>> "service": "pop3d", >>> "ip": "[10.0.0.1]", >>> "top": "0", >>> "retr": "0" >>> } >>> >>> I've got one design document, with 4 views in. All of them have reduce >>> steps >>> as well. I've placed all the code in a Gist to keep the mail clean: >>> http://gist.github.com/92476 >>> >>> Basically I get the following from the different views: >>> >>> * days - Days and number of activities, used as a key lookup for... >>> * daily - Total aggregate usage for each user on the day >>> * months & monthly work the same as the above, except over months >>> >>> Updating the indexes are incredibly slow, and I have no idea where to >>> begin >>> looking. I suspect my maps are "expensive", but since this is my first >>> shot >>> I'll keep quiet and listen to any advice. With "slow" I mean that on my >>> local development VM (gentoo, couch 0.9, erlang R12B-5, js 1.7) >>> processing a >>> 150,000 docs is closing in on 24 hours... On a production site I have >>> 3,300,000 docs and over about 18 hours it has only indexed 264,091 >>> documents >>> (7%). I built the views using only a couple of hundred docs, probably >>> less >>> than 1,000, and didn't expect this to happen... >>> >>> From reading other posts in the archives I know the initial index can >>> take a >>> while, but somehow this just seems a bit ridiculous. >>> >>> Any advice would be greatly appreciated. >>> >>> Thanks in advance, and thanks for the awesome tool you guys have built. >>> >>> Best >>> >>> -- >>> Kenneth Kalmer >>> [email protected] >>> http://opensourcery.co.za >>> > >
