Hey Chris,
The main issue here is how good your view server is; can it take
getting
1000's of responses at once? An HTTP view response would be nice...
I'm
pretty sure that CouchDB could handle getting all the requests from
workers.
I think this could also allow for view of view processing, without
going
through/maintaining an intermediate database.
The reason we haven't implemented something like this yet is that it
assumes that your bottleneck is CPU-time, and that it's worth it to
move docs across a cluster to be processed, then return the rows to
the original Couch for indexing.
This might help a little bit in cases where your map function is very
CPU intensive, but you aren't going to get 8x faster by using 8 boxes,
because the bottleneck will quickly become updating the view index
file on the original Couch.
Partitioning (a cluster of say 8 Couches where each Couch has 1/8th
the data) will speed your view generation up by roughly 8x (at the
expense of slightly higher http-query overhead.) In this approach, the
map reduce model on an individual of one of those Couches isn't any
different than it is today. The functions are run close to the data
(no map reduce network overhead), and the rows are stored in 1 index
per Couch, which is what allows the 8x speedup.
Does this make sense?
Sure does. However, say I have a cluster of a few thousand machines
(which I do) I'm not sure that will work. The map/reduce would be very
CPU intensive (a map would be a simulation, lasting minutes per
document, seeded from values in a couch document). The machines are in
a traditional batch system, so setting up long-lived services on them
(e.g. an instance of CouchDB) isn't going to fly. The documents and
results are small but numerous (millions). I agree I won't get a
linear scaling with the number of machines, but it might make
something not possible in a standard couch cluster (because the map
will take millions of minutes) possible.
Maybe Couch isn't a good fit for this and I should use Hadoop for
everything, but it's fun to find out :)
Cheers
Simon