On Apr 9, 2009, at 11:17 AM, Paul Davis wrote:
Kenneth,
I'm pretty sure you're issue is in the reduce steps for the daily and
montly views. The general rule of thumb is that you shouldn't be
returning data that grows faster than log(#keys processed) where as I
believe your data is growing linearly with input.
This particular limitation is a result of the implementation of
incremental reductions. Basically, each key/pointer pair stores the
re-reduced value for all [re-]reduce values in its children nodes. So
as your reduction moves up the tree the data starts exploding which
kills btree performance not to mention the extra file I/O.
The basic moral of the story is that if you want reduce views like
this per user you should emit a [user_id, date] pair as the key and
then call your reduce views with group=true.
+1 Paul.
New users hit this problem a lot, and since it's manifests as a
performance problem, users spend more time than necessary trying to
figure out what's wrong. I wonder if there is something we can do to
make it more obvious when reduce is used incorrectly? Perhaps a limit
(say 1k) on the size of the reduce value, and when it's exceeded a
"reduce value to large" error is generated. In process of
investigating the error they'll be more likely find the documentation
that explains what they doing wrong.
Moving this discussion to d...@. Anyone else have any thoughts or ideas?
-Damien
HTH,
Paul Davis
On Thu, Apr 9, 2009 at 10:25 AM, Kenneth Kalmer
<[email protected]> wrote:
Hi everyone
After months of lurking and reading up on couch I finally got the
time to
start using it for an internal mail log analyzer. I parse the logs
from our
Courier-IMAP installation and convert the different lines into
documents and
this has proven to work quite well.
My first task is to extract some metrics from these docs regarding
how
oftern people "pop" their mail, and the returned sizes of each "pop".
Documents in question look like this:
{
"_id": "0000f68e73f3521f3ee8b3b51e0101d7",
"_rev": "1-3732031452",
"user": "[email protected]",
"host": "pop-5",
"time": "2009/03/13 05:47:08 +0000",
"action": "LOGOUT",
"service": "pop3d",
"ip": "[10.0.0.1]",
"top": "0",
"retr": "0"
}
I've got one design document, with 4 views in. All of them have
reduce steps
as well. I've placed all the code in a Gist to keep the mail clean:
http://gist.github.com/92476
Basically I get the following from the different views:
* days - Days and number of activities, used as a key lookup for...
* daily - Total aggregate usage for each user on the day
* months & monthly work the same as the above, except over months
Updating the indexes are incredibly slow, and I have no idea where
to begin
looking. I suspect my maps are "expensive", but since this is my
first shot
I'll keep quiet and listen to any advice. With "slow" I mean that
on my
local development VM (gentoo, couch 0.9, erlang R12B-5, js 1.7)
processing a
150,000 docs is closing in on 24 hours... On a production site I have
3,300,000 docs and over about 18 hours it has only indexed 264,091
documents
(7%). I built the views using only a couple of hundred docs,
probably less
than 1,000, and didn't expect this to happen...
From reading other posts in the archives I know the initial index
can take a
while, but somehow this just seems a bit ridiculous.
Any advice would be greatly appreciated.
Thanks in advance, and thanks for the awesome tool you guys have
built.
Best
--
Kenneth Kalmer
[email protected]
http://opensourcery.co.za