Re: Some guidance with extremely slow indexing

Zachary Zolton Thu, 09 Apr 2009 13:06:46 -0700

Perhaps we need a section of the wiki devoted to these kinds of best practices.


Would someone with a strong understanding of this issue care to elucidate?

On Thu, Apr 9, 2009 at 2:45 PM, Damien Katz <[email protected]> wrote:
>
> On Apr 9, 2009, at 11:17 AM, Paul Davis wrote:
>
>> Kenneth,
>>
>> I'm pretty sure you're issue is in the reduce steps for the daily and
>> montly views. The general rule of thumb is that you shouldn't be
>> returning data that grows faster than log(#keys processed) where as I
>> believe your data is growing linearly with input.
>>
>> This particular limitation is a result of the implementation of
>> incremental reductions. Basically, each key/pointer pair stores the
>> re-reduced value for all [re-]reduce values in its children nodes. So
>> as your reduction moves up the tree the data starts exploding which
>> kills btree performance not to mention the extra file I/O.
>>
>> The basic moral of the story is that if you want reduce views like
>> this per user you should emit a [user_id, date] pair as the key and
>> then call your reduce views with group=true.
>
> +1 Paul.
>
> New users hit this problem a lot, and since it's manifests as a performance
> problem, users spend more time than necessary trying to figure out what's
> wrong. I wonder if there is something we can do to make it more obvious when
> reduce is used incorrectly? Perhaps a limit (say 1k) on the size of the
> reduce value, and when it's exceeded a "reduce value to large" error is
> generated. In process of investigating the error they'll be more likely find
> the documentation that explains what they doing wrong.
>
> Moving this discussion to d...@. Anyone else have any thoughts or ideas?
>
> -Damien
>
>>
>> HTH,
>> Paul Davis
>>
>> On Thu, Apr 9, 2009 at 10:25 AM, Kenneth Kalmer
>> <[email protected]> wrote:
>>>
>>> Hi everyone
>>>
>>> After months of lurking and reading up on couch I finally got the time to
>>> start using it for an internal mail log analyzer. I parse the logs from
>>> our
>>> Courier-IMAP installation and convert the different lines into documents
>>> and
>>> this has proven to work quite well.
>>>
>>> My first task is to extract some metrics from these docs regarding how
>>> oftern people "pop" their mail, and the returned sizes of each "pop".
>>> Documents in question look like this:
>>>
>>> {
>>>  "_id": "0000f68e73f3521f3ee8b3b51e0101d7",
>>>  "_rev": "1-3732031452",
>>>  "user": "[email protected]",
>>>  "host": "pop-5",
>>>  "time": "2009/03/13 05:47:08 +0000",
>>>  "action": "LOGOUT",
>>>  "service": "pop3d",
>>>  "ip": "[10.0.0.1]",
>>>  "top": "0",
>>>  "retr": "0"
>>> }
>>>
>>> I've got one design document, with 4 views in. All of them have reduce
>>> steps
>>> as well. I've placed all the code in a Gist to keep the mail clean:
>>> http://gist.github.com/92476
>>>
>>> Basically I get the following from the different views:
>>>
>>> * days - Days and number of activities, used as a key lookup for...
>>> * daily - Total aggregate usage for each user on the day
>>> * months & monthly work the same as the above, except over months
>>>
>>> Updating the indexes are incredibly slow, and I have no idea where to
>>> begin
>>> looking. I suspect my maps are "expensive", but since this is my first
>>> shot
>>> I'll keep quiet and listen to any advice. With "slow" I mean that on my
>>> local development VM (gentoo, couch 0.9, erlang R12B-5, js 1.7)
>>> processing a
>>> 150,000 docs is closing in on 24 hours... On a production site I have
>>> 3,300,000 docs and over about 18 hours it has only indexed 264,091
>>> documents
>>> (7%). I built the views using only a couple of hundred docs, probably
>>> less
>>> than 1,000, and didn't expect this to happen...
>>>
>>> From reading other posts in the archives I know the initial index can
>>> take a
>>> while, but somehow this just seems a bit ridiculous.
>>>
>>> Any advice would be greatly appreciated.
>>>
>>> Thanks in advance, and thanks for the awesome tool you guys have built.
>>>
>>> Best
>>>
>>> --
>>> Kenneth Kalmer
>>> [email protected]
>>> http://opensourcery.co.za
>>>
>
>

Re: Some guidance with extremely slow indexing

Reply via email to