Re: Some guidance with extremely slow indexing

Damien Katz Thu, 09 Apr 2009 12:45:52 -0700


On Apr 9, 2009, at 11:17 AM, Paul Davis wrote:

Kenneth,

I'm pretty sure you're issue is in the reduce steps for the daily and
montly views. The general rule of thumb is that you shouldn't be
returning data that grows faster than log(#keys processed) where as I
believe your data is growing linearly with input.

This particular limitation is a result of the implementation of
incremental reductions. Basically, each key/pointer pair stores the
re-reduced value for all [re-]reduce values in its children nodes. So
as your reduction moves up the tree the data starts exploding which
kills btree performance not to mention the extra file I/O.

The basic moral of the story is that if you want reduce views like
this per user you should emit a [user_id, date] pair as the key and
then call your reduce views with group=true.


+1 Paul.

New users hit this problem a lot, and since it's manifests as aperformance problem, users spend more time than necessary trying tofigure out what's wrong. I wonder if there is something we can do tomake it more obvious when reduce is used incorrectly? Perhaps a limit(say 1k) on the size of the reduce value, and when it's exceeded a"reduce value to large" error is generated. In process ofinvestigating the error they'll be more likely find the documentationthat explains what they doing wrong.


Moving this discussion to d...@. Anyone else have any thoughts or ideas?

-Damien

HTH,
Paul Davis

On Thu, Apr 9, 2009 at 10:25 AM, Kenneth Kalmer
<[email protected]> wrote:
Hi everyone
After months of lurking and reading up on couch I finally got thetime tostart using it for an internal mail log analyzer. I parse the logsfrom ourCourier-IMAP installation and convert the different lines intodocuments and
this has proven to work quite well.
My first task is to extract some metrics from these docs regardinghow
oftern people "pop" their mail, and the returned sizes of each "pop".
Documents in question look like this:

{
  "_id": "0000f68e73f3521f3ee8b3b51e0101d7",
  "_rev": "1-3732031452",
  "user": "[email protected]",
  "host": "pop-5",
  "time": "2009/03/13 05:47:08 +0000",
  "action": "LOGOUT",
  "service": "pop3d",
  "ip": "[10.0.0.1]",
  "top": "0",
  "retr": "0"
}
I've got one design document, with 4 views in. All of them havereduce steps
as well. I've placed all the code in a Gist to keep the mail clean:
http://gist.github.com/92476

Basically I get the following from the different views:

* days - Days and number of activities, used as a key lookup for...
* daily - Total aggregate usage for each user on the day
* months & monthly work the same as the above, except over months
Updating the indexes are incredibly slow, and I have no idea whereto beginlooking. I suspect my maps are "expensive", but since this is myfirst shotI'll keep quiet and listen to any advice. With "slow" I mean thaton mylocal development VM (gentoo, couch 0.9, erlang R12B-5, js 1.7)processing a
150,000 docs is closing in on 24 hours... On a production site I have
3,300,000 docs and over about 18 hours it has only indexed 264,091documents(7%). I built the views using only a couple of hundred docs,probably less
than 1,000, and didn't expect this to happen...
From reading other posts in the archives I know the initial indexcan take a
while, but somehow this just seems a bit ridiculous.

Any advice would be greatly appreciated.
Thanks in advance, and thanks for the awesome tool you guys havebuilt.
Best

--
Kenneth Kalmer
[email protected]
http://opensourcery.co.za

Re: Some guidance with extremely slow indexing

Reply via email to