Hi,
        This is OT for the original discussion imho

On 25 Jan 2010, at 01:16, Glenn Rempe wrote:

I'd be interested if anyone with partitioned CouchDB query experience
(Lounger or otherwise) can comment on view generation time when
parallelized across multiple machines.


I would also be interested in seeing any architectures that make use of this to parallelize view generation. I'm not sure your example of Hadoop or Google M/R are really valid because they provide file system abstractions (e.g. Hadoop FS) for automatically streaming a single copy of the data to where it is needed to be Mapped/Reduced and CouchDB has nothing similar.

IMHO something like HDFS isn't needed, since there's already a simple, scalable way of getting at the data. What I'd like (to have time to work on...) is the following:

1. be able to configure a pipeline of documents that are sent to the view server 1a. be able to set the size of that pipeline to 0, which just sends a sane header (there are N documents in the database) 2. view server spawns off child processes (I'm thinking Disco, but Hadoop would be able to do the same) on the various worker nodes 3. each worker is given a range of documents to process, pulls these in from _all_docs
4. worker processes its portion of the database
5. worker returns its results to the view server which aggregates them up into the final view

The main issue here is how good your view server is; can it take getting 1000's of responses at once? An HTTP view response would be nice... I'm pretty sure that CouchDB could handle getting all the requests from workers. I think this could also allow for view of view processing, without going through/maintaining an intermediate database.
Cheers
Simon

Reply via email to