On Apr 28, 2009, at 11:42 AM, Brian Candler wrote:

On Sun, Apr 26, 2009 at 11:26:33PM +0200, Wout Mertens wrote:
In a nutshell, I'm hoping that:
* A review is a new sort of view that has an "inputs" array in its
definition.
* Only MR views are allowed as inputs, no KV duplication allowed.
* It builds a persistent index of the incoming views when those get
updated.
* That index is then used to build the view index for the review when
the review gets updated.
* I think I covered the most important algorithms needed to implement
this in my original proposal.

Does this sound feasible? If so I'll update my proposal accordingly.

Could you define a bit more where the "inputs" array comes from?

Oh right, I didn't explain that well, and I never said how to add the group_level. Here's an example for counting the number of unique names in a database:

{
  "_id": "_design/foo",
  "views": {
      "unique_names": {
          "map": "function(doc) {emit(doc.name, 1);}",
"reduce": "function(keys, values, rereduce) {return sum(values);}"
     }
      "unique_names_count": {
          "inputs": {"unique_names":1},
          "map": "function(doc) {emit(doc.key, 1);}",
"reduce": "function(keys, values, rereduce) {return sum(values);}"
     }
  }
}

In this case, unique_names_count is the review db. It gets input from the unique_names view with group_level set to 1. Note that it's a hash since each view only gets added once and each view has a unique name.

(BTW, I can't figure out what to set group_level to if I want the equivalent of group=true. group_level=exact doesn't work although group_level=1000 might be good enough. How about allowing group_level=-1?)

(Also, in this case just "unique_names_count": {"inputs": {"unique_names":1}} would suffice since getting that view with limit=0 will give you the total_rows number which is what was needed)

Furthermore, if I understand rightly, these grouped reduce values are *not* persisted anywhere (*). That is: every time you do a reduce=true&group=true query then the entire map index is scanned for distinct keys and then each
set of values with equal keys is passed afresh to the reduce function.

This means that it won't be possible to do incremental changes to the review DB, since these grouped keys and reduce values aren't associated with the docids they came from. You'd have to calculate it all from scratch every time, in which case you might as well just get the client to do it. AFAICS, CouchDB could cache the result only when the database is not receiving any
updates.

You're correct in that group reduce values are not persisted, however the algorithm I proposed can do incremental updates nonetheless:
=========
Group key values are calculated on-the-fly from the view result b- tree. So whenever a reduce call results in a new value for a b-tree node, AND that node is the upper node of a subtree that is completely part of a group key, that group key needs to be marked for recalculation.

Likewise, if deletion/addition of a b-tree node results in the removal/ creation of the sole upper node of a group key subtree, that group key needs to be marked for removal/addition.

This is the algorithm:
- When a reducing view gets updated, and it is part of a Review DB, use the 2 paragraphs above to keep a list of group keys that need handling - After updating the reduce() results, for each of the marked group keys:
- If a group key gets removed:
  - look up doc with key=group key in review db. If exists:
    - delete doc
- If a group key gets added:
  - look up doc with key=group key in review db. If exists:
    - set doc.value to the row value
  - else
- create doc with id=group key in string form, key=group key, value=value
- If a group key gets updated:
  - look up doc with key=group key in review db. If exists:
    - set doc.value to the row value
  - else
- create doc with id=group key in string form, key=group key, value=value
========
Is that explanation clear?

I'm imagining that at any point on the b-tree where a reduce value is received, CouchDB can see if that node is one of the top nodes for a group key and can mark that group key somehow. I think that should be feasible (although I admit I've never tried to grok the code).

Cheers,

Wout.

Reply via email to