Re: chaining map reduce in hovercraft

Chris Anderson Fri, 05 Jun 2009 12:42:56 -0700

On Fri, Jun 5, 2009 at 7:13 AM, Zachary Zolton <[email protected]> wrote:
> So, Chris, it sounds like you're saying that POSTing to that URL will
> place the entire results of querying the view with group=true into
> another database. Sounds great!
>
> Will it work with 0.9? Would you suggest automating this using _changes?
>

I doubt this will get backported to the 0.9.x branch.

However, this is possible with 0.9 if you do it in a client. There are
examples in my CouchRest client of running a Ruby function over the
unique keys in a map view, but the pattern of just dumping a group
reduce function into another DB is simple and effective.

What I'm adding is simply a shortcut so that people can more
effectively play around with chaining map reduce queries. For now the
snapshot dbs will not update incrementally. However, they are just
documents so you can do in-place transformations on them (if you
want).

--- Actually I'm having second thoughts about putting this into
CouchDB. It's still a worthwhile technique, but I think we should
encourage you to use HTTP tools to run it. Here's why:

So, on a single node, this would be all well and good - you'd be able
to get a sorted list of tags by popularity, by running a simple
map-by-group-reduce-value view on the snapshot database.

On a clustered setup, like couchdb-lounge provides, you'd end up with
problems, as each snapshot db would only reflect reductions run
locally (on the single shard). This is because the Erlang API used by
Hovercraft is not a multi-node API. Eventually we could give CouchDB
an internal Erlang proxy - but for now, multi-node clusters must be
built on HTTP.

So, since these Hovercraft chain snapshots are built against a single
node, the fullly merged sort-by-value map query across the cluster
could have incorrect ordering.

To guarantee correct ordering of tags by popularity in a clustered
deployment, you'd have to run the global reduce function (not against
a single local node) but against the entire cluster, via something
like couchdb-lounge's Twisted Python rereducing proxy.

Ergo, a group-reduce chaining library is better off not written via
Hovercraft, because it should use the HTTP API. Anyone have a Python
version of this?

Performance freaks don't worry - in this application of HTTP there are
just a handful of long running connections and you should be able to
get disk IO bound even with the HTTP overhead.

Chris

> Cheers,
> Zach
>
> On Fri, Jun 5, 2009 at 6:17 AM, Viacheslav Seledkin
> <[email protected]> wrote:
>> Chris Anderson wrote:
>>>
>>> I finally got around to writing my map reduce copier. it's still
>>> basic, but what do you think?
>>>
>>> I want to put it into trunk as an http call, like:
>>>
>>> POST /_snapshot_view
>>>
>>> with JSON
>>>
>>> {"src":"/srcdb/_design/app/_view/reduce_count", "group_level":2,
>>> "target":"/targetdb"}
>>>
>>> Chainable map reduce seems to be one of the most popular requests on
>>> the survey we took, so hopefully this will make the heavy-data crew
>>> happy.
>>>
>>> There is an implementation here:
>>>
>>>
>>> http://github.com/jchris/hovercraft/commit/34b44527b660a740858cc71aa2c8326747465e31#L0R290
>>>
>>> What this does is take the results you'd get from query your reduce
>>> view with group=true, and copy them to a new database. Basically you
>>> end up with a database full of docs that look like:
>>>
>>> {
>>> "key":[2009,2,14],
>>> "value": 511
>>> }
>>>
>>> Since they are docs sitting in another CouchDB, you can use more
>>> ordinary CouchDB Map Reduce views on that database to do things like
>>> sort by value, so you can for instance sort tags by popularity, or
>>> days by user activity, etc.
>>>
>>> Chris
>>>
>>>
>>> --
>>> Chris Anderson
>>> http://jchrisa.net
>>> http://couch.io
>>>
>>> .
>>>
>>>
>>
>> The process of updating of shapshot db will be incremental?
>>
>

-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Re: chaining map reduce in hovercraft

Reply via email to