Intro
=====
How do you sort by reduce value? How do you join views? How do you
get
unique view results? How do you cache group key reduces?
I think that with the below proposed solution all the above and
more are
possible. The general idea is to store view results and run map/
reduce on
them. There's been some discussions about this but they went
nowhere. I've
been thinking about this issue a bit and I think it can be done.
I'd like to call this feature a Review DB.
Use cases
=========
- Suppose you want to know what tags are most popular on your
blog. Simply
get:
http://couchdb/db/_design/myblog/_review/tags_by_count/_view/sort_by_value
Where tags_by_count is a Review DB that gets input from the
tagcount view
and then runs the sort_by_value view on it, a map() function that
simply
emits (value,key).
Likewise, show pages in order of popularity, whereby user can vote
up (+1)
or down (-1):
http://couchdb/db/_design/mywiki/_review/pagevotes/_view/sort_by_value
- Given documents with attributes title, date and tags. You'd like
to know
the minimum value of date and a breakdown by count for tags, for
every
title. Normally you'd use 2 map+reduce views,
minimum_date_by_title and
tagcount_by_title, which you would then query separately. With a
Review DB,
you can let both views insert their results in the database and
then run a
view that combines the results in one view:
http://couchdb/db/_design/mybookstore/_review/mybooks/_view/aggregate_book_data
- This is not a way to run an on-the-fly map/reduce on a subset of
a view,
like if you want to find the median popularity score of
restaurants with
"Tony" in their name that are close to you.
Implementation
==============
A Review DB is a hidden database maintained by CouchDB with these
fields:
- _id of document is the string representation of the key
- "key" is the key of the incoming view row (unique)
- "value" is the value of the incoming view row
I hope that this is sufficiently like a normal view that it can be
stored
as a normal view. _id is just there to make it doc-compliant, it
would be
much better if "key" were the actual key.
A Review DB is defined in a design document like normal views.
Each review
is an entry in the "reviews" hash, and has a "incoming_views"
array that
lists all the views that should insert results in the review db
plus the
group level, as well as a normal "views" hash for further map/
reduce of the
review db (and perhaps another "reviews" hash for further result
processing?).
Maintaining a database of results means that results have to be
updated or
even removed when documents change. I tried to make this work (in
theory)
for map-only views, but the resulting requirements are quite
messy. You
either need to cache the previous results of a view for each
document, or
you have to have an old version of the document available to
regenerate
those results.
Therefore, a Review DB only accepts results from one or more map
+reduce
views. You define beforehand what the group_level of the keys is
that will
be inserted.
Furthermore, a Review DB disallows (but doesn't enforce) having 2
views
that generate the same keys. Otherwise, refcounting would need to
be used
and while that's not difficult, I think there's limited value in
allowing
this.
The Review DB needs updating every time the reduction for a group
key of
one of the participating views gets updated. Even though a map
+reduce view
has unique keys, we need a refcount since we have multiple views.
Whoever
got to insert its value last wins.
There is a slight complication: group key values are calculated on-
the-fly
from the view result b-tree. So whenever a reduce call results in
a new
value for a b-tree node, AND that node is the upper node of a
subtree that
is completely part of a group key, that group key needs to be
marked for
recalculation.
Likewise, if deletion/addition of a b-tree node results in the
removal/creation of the sole upper node of a group key subtree,
that group
key needs to be marked for removal/addition.
This is the algorithm:
- When a reducing view gets updated, and it is part of a Review
DB, use
the 2 paragraphs above to keep a list of group keys that need
handling
- After updating the reduce() results, for each of the marked
group keys:
- If a group key gets removed:
- look up doc with key=group key in review db. If exists:
- delete doc
- If a group key gets added:
- look up doc with key=group key in review db. If exists:
- set doc.value to the row value
- else
- create doc with id=group key in string form, key=group key,
value=value
- If a group key gets updated:
- look up doc with key=group key in review db. If exists:
- set doc.value to the row value
- else
- create doc with id=group key in string form, key=group key,
value=value
As you can see, this is something CouchDB should do since it knows
when
it's updating group key reduction values and it knows if this was
an delete,
update or addition.
View updates are done when the view is called; Review updates are
done at
this time as well. Views on Review DBs are done when they are
called.
Summary
=======
Review DBs are a sort of view index that CouchDB can maintain with
little
overhead. It caches group key results and allows chained map+reduce
calculations using mostly existing frameworks.
I think this would be a very useful feature for CouchDB to have.
There are
regularly requests for storing view results in a database for
post-processing on the mailing lists.
I'm not saying this is a trivial change but it doesn't seem
technically
impossible to me either. (unless I missed something again; this is
the 5th
iteration of this proposal. Anyway I know *I* wouldn't be able to
code this
:-) )
What do you think, oh dear devs?
Wout.