On Jul 3, 2009, at 6:37 PM, Chris Anderson wrote:
2009/7/3 Göran Krampe <[email protected]>:
Hi folks!
We are writing an app using CouchDB where we tried to do some map/
reduce to
calculate "period sums" for about 1000 different "accounts". This
is fiscal
data btw, the system is meant to store detailed fiscal data for
about 50000
companies, for starters. :)
The map function is trivial, it just emits a bunch of "accountNo,
amount"
pairs with "month" as key.
The reduce/rereduce take these and builds a dictionary (JSON
object) with
"month-accountNo" as key (like "2009/10-2335" and the sum as the
value. This
works fine, yes, it builds up a bit but there is a maximum of account
numbers and months so it doesn't grow out of control, so that is
NOT the
issue.
There is *no reason ever* to build up a dictionary with more then a
small handful of items in it. Eg it's ok if your dictionary has this
fixed set of keys: count, total, stddev, avg.
It's not OK to do what you are doing. This is what group_level is for.
Rewrite your map reduce to be correct and then we can start talking
about performance.
I don't mean to be harsh but suggesting you have a performance problem
here is like me complaining that my Ferrari makes a bad boat.
Cheers,
Chris
Wow, that was unusually harsh coming from you, Chris. Taking a closer
look at Göran's map and reduce functions I agree that they should be
reworked to make use of group=true, but nevertheless I wonder if we do
have something to work on here.
I believe Göran's problem was that the second pass was causing the
view updater process to use a significant amount of memory and trigger
should_flush() immediately. As a result, view KVs were being written
to disk after every document (triggering the reduce/rereduce step).
This is fantastically inefficient. If the penalty for flushing
results to disk during indexing is so severe, perhaps we want to be a
little more flexible in imposing it. There could be very legitimate
cases where users with large documents and/or sophisticated workflows
are hung out to dry during indexing because the view updater wants a
measly 11MB of memory to do its job.
Adam
Ok, here comes the punchline. When we dump the first 1000 docs
using bulk,
which typically will amount to say 5000 emits - and we "touch" the
view to
trigger it - it will be rather fast and behaves like this:
- a single Erlang process runs and emits all values, then it does a
bunch or
reduce on those values and finally it switches into rereduce mode
and does
those and then you can see the dictionary "growing" a bit but never
too
much. It is pretty fast, a second or two all in all.
Fine. Them we dump the *next* 1000 docs into Couch and triggers the
view
again. This time it behaves like this (believe it or not):
- two Erlang processes get into play. It seems the same process as
above
continues with emits (IIRC) but a second one starts doing reduce/
rereduce
*while the first one is emitting*.
This is actually by design.
Ouch. And to make it worse - the second one seems to gradually
"take over" until we only see 2-3 emits followed by tons of
rereduces (all the way up I guess for each emit).
This is not.
Sooo... evidently Couch decides to do stuff in parallell and starts
doing
reduce/rereduce while emitting here. AFAIK this is not the behavior
described.
Not sure if it's described, but it is by design. The reduce function
executes when the btree is modified. We can't afford to cache KVs
from an index update in memory regardless of size; we have to set some
threshold when we flush them to disk.
I think the fundamental question is why the flush operations were
occurring so frequently the second time around. Is it because you
were building up a largish hash for the reduce value? Probably.
Nevertheless, I'd like to have a better handle on that.
Adam
The net effect is that the view update that took 1-2 seconds
suddenly takes 400 seconds or goes to a total crawl and never seems
to end.
By looking at the log it obviously processes ONE doc at a time -
giving us
2-5 emits typically and then tries to reduce that all the way up to
the root
before processing the next doc. So the rereduces for the internal
nodes will
be run typically in this case 1000x more than needed.
Phew. :) Ok, so we are basically hosed with this behavior in this
situation.
I can only presume this has gone unnoticed because:
a) Updates most of us do are small. But we dump thousands of new
docs using
bulk (a full new fiscal year of data for a given company) so we
definitely
notice it.
b) Most reduce/rereduce functions are very, very fast. So it goes
unnoticed.
Our functions are NOT that fast - but if they were only run as they
should
(well, presuming they *should* only be run after all the emits for
all doc
changes in a given view update) it would indeed be fast anyway. We
can see
that since the first 1000 docs work fine.
...and thanks to the people on #couchdb for discussing this with me
earlier
today and looking at the Erlang code to try to figure it out. I
think Adam
Kocolski and Robert Newson had some idea about it.
regards, Göran
PS. I am on vacation now for 4 weeks, so I will not be answering
much email.
I wanted to get this posted though since it is in some sense a
rather ...
serious performance bottleneck.
--
Chris Anderson
http://jchrisa.net
http://couch.io