Zachary,
Awesome. The thing with non-incremental updates is that the basic
algorithm would be to just look for updates to the view and on update,
delete the review DB, create a new one, and then dump the new data into
it. I wouldn't try too hard for the optimizing updates at this point in
time.
Getting a ruby script out to show the basics should probably be the
first step. Beyond that we'll have to take it a step at a time.
HTH,
Paul Davis
Zachary Zolton wrote:
@jchris et al,
if you had any pointer, on how to implement this, i have a strong
motivation to try my hand at it.
i have a janky ruby script running as an update notifier that looks
for certain criteria, idiomatic to my data, that puts docs into a
derived database. but i'm not terribly happy with my current
implementation...
is there a general-purpose algorithm for dealing with updates?
cheers,
zach
On Sun, Apr 26, 2009 at 10:20 PM, Chris Anderson <[email protected]> wrote:
Sent from my iPhone
On Apr 26, 2009, at 2:26 PM, Wout Mertens <[email protected]> wrote:
Hi Adam,
On Apr 22, 2009, at 4:48 PM, Adam Kocoloski wrote:
Hi Wout, thanks for writing this up.
One comment about the map-only views: I think you'll find that Couch has
already done a good bit of the work needed to support them, too. Couch
maintains a btree for each design doc keyed on docid that stores all the
view keys emitted by the maps over each document. When a document is
updated and then analyzed, Couch has to consult that btree, purge all the
KVs associated with the old version of the doc from each view, and then
insert the new KVs. So the tracking information correlating docids and view
keys is already available.
See I did not know that :-) Although I should have guessed.
However, in the mail before this one I argued that it doesn't make sense
to combine or chain map-only views since you can always write a map function
that does it in one step. Do you agree?
You might also know the answer to this: is it possible to make the Review
DB be a sort of view index on the current database? All it needs are JSON
keys and values, no other fields.
You'd still be left with the problem of generating unique docids for the
documents in the Review DB, but I think that's a problem that needs to be
solved. The restriction to only MR views with no duplicate keys across
views seems too strong to me.
Well, since the Review DB is a local(*) hidden database that's handled a
bit specially, I think the easiest is to assign _id a sequence number and
create a default view that indexes the documents by doc.key (for updating
the value for that key). There will never be contention and we're only
interested in the key index.
We discussed this a little at CouchHack and I argued that the simplest
solution is actually good for a few reasons.
The simple solution: provide a mechanism to copy the rows of a grouped
reduce function to a new database.
Good because it is most like Hadoop/Google style map reduce. In that
paradigm, the output of a map/reduce job is not incremental, and it is
persisted in a way that allows for multiple later reduce stages to be run on
it. It's common in Hadoop to chain many m/r stages, and to try a few
iterations of each stage while developing code.
I like this also because it provides the needed functionality without adding
any new primitives to CouchDB.
The only downside of this approach is that it is not incremental. I'm not
sure that incremental chainability has much promise, as the index management
could be a pain, especially if you have branching chains.
Another upside is that by reducing to a db, you give the user power to do
things like use replication to merge multiple data sets before applying more
views.
I don't want to discourage anyone from experimenting with code, just want to
point out this simple solution which would be Very Easy to implement.
(*)local: I'm assuming that views are not replicated and need to be
recalculated for each CouchDB node. If they are replicated somehow, I think
it would still work but we'd have to look at it a little more.
With that said, I'd prefer to spend my time extending the view engine to
handle chainable MR workflows in a single shot. Especially in the simple
sort_by_value case it just seems like a cleaner way to go about things.
Yes, that seems to be the gist of all repliers and I agree :-)
In a nutshell, I'm hoping that:
* A review is a new sort of view that has an "inputs" array in its
definition.
* Only MR views are allowed as inputs, no KV duplication allowed.
* It builds a persistent index of the incoming views when those get
updated.
* That index is then used to build the view index for the review when the
review gets updated.
* I think I covered the most important algorithms needed to implement this
in my original proposal.
Does this sound feasible? If so I'll update my proposal accordingly.
Wout.