On Apr 9, 2009, at 11:17 AM, Paul Davis wrote:

Kenneth,

I'm pretty sure you're issue is in the reduce steps for the daily and
montly views. The general rule of thumb is that you shouldn't be
returning data that grows faster than log(#keys processed) where as I
believe your data is growing linearly with input.

This particular limitation is a result of the implementation of
incremental reductions. Basically, each key/pointer pair stores the
re-reduced value for all [re-]reduce values in its children nodes. So
as your reduction moves up the tree the data starts exploding which
kills btree performance not to mention the extra file I/O.

The basic moral of the story is that if you want reduce views like
this per user you should emit a [user_id, date] pair as the key and
then call your reduce views with group=true.

+1 Paul.

New users hit this problem a lot, and since it's manifests as a performance problem, users spend more time than necessary trying to figure out what's wrong. I wonder if there is something we can do to make it more obvious when reduce is used incorrectly? Perhaps a limit (say 1k) on the size of the reduce value, and when it's exceeded a "reduce value to large" error is generated. In process of investigating the error they'll be more likely find the documentation that explains what they doing wrong.

Moving this discussion to d...@. Anyone else have any thoughts or ideas?

-Damien


HTH,
Paul Davis

On Thu, Apr 9, 2009 at 10:25 AM, Kenneth Kalmer
<[email protected]> wrote:
Hi everyone

After months of lurking and reading up on couch I finally got the time to start using it for an internal mail log analyzer. I parse the logs from our Courier-IMAP installation and convert the different lines into documents and
this has proven to work quite well.

My first task is to extract some metrics from these docs regarding how
oftern people "pop" their mail, and the returned sizes of each "pop".
Documents in question look like this:

{
  "_id": "0000f68e73f3521f3ee8b3b51e0101d7",
  "_rev": "1-3732031452",
  "user": "[email protected]",
  "host": "pop-5",
  "time": "2009/03/13 05:47:08 +0000",
  "action": "LOGOUT",
  "service": "pop3d",
  "ip": "[10.0.0.1]",
  "top": "0",
  "retr": "0"
}

I've got one design document, with 4 views in. All of them have reduce steps
as well. I've placed all the code in a Gist to keep the mail clean:
http://gist.github.com/92476

Basically I get the following from the different views:

* days - Days and number of activities, used as a key lookup for...
* daily - Total aggregate usage for each user on the day
* months & monthly work the same as the above, except over months

Updating the indexes are incredibly slow, and I have no idea where to begin looking. I suspect my maps are "expensive", but since this is my first shot I'll keep quiet and listen to any advice. With "slow" I mean that on my local development VM (gentoo, couch 0.9, erlang R12B-5, js 1.7) processing a
150,000 docs is closing in on 24 hours... On a production site I have
3,300,000 docs and over about 18 hours it has only indexed 264,091 documents (7%). I built the views using only a couple of hundred docs, probably less
than 1,000, and didn't expect this to happen...

From reading other posts in the archives I know the initial index can take a
while, but somehow this just seems a bit ridiculous.

Any advice would be greatly appreciated.

Thanks in advance, and thanks for the awesome tool you guys have built.

Best

--
Kenneth Kalmer
[email protected]
http://opensourcery.co.za


Reply via email to