Parallel view generation (was Re: replicator options)

Simon Metson Mon, 25 Jan 2010 03:10:53 -0800

Hi,
        This is OT for the original discussion imho


On 25 Jan 2010, at 01:16, Glenn Rempe wrote:

I'd be interested if anyone with partitioned CouchDB query experience
(Lounger or otherwise) can comment on view generation time when
parallelized across multiple machines.
I would also be interested in seeing any architectures that make useof thisto parallelize view generation. I'm not sure your example of HadooporGoogle M/R are really valid because they provide file systemabstractions(e.g. Hadoop FS) for automatically streaming a single copy of thedata towhere it is needed to be Mapped/Reduced and CouchDB has nothingsimilar.

IMHO something like HDFS isn't needed, since there's already a simple,scalable way of getting at the data. What I'd like (to have time towork on...) is the following:

1. be able to configure a pipeline of documents that are sent to theview server1a. be able to set the size of that pipeline to 0, which just sends asane header (there are N documents in the database)2. view server spawns off child processes (I'm thinking Disco, butHadoop would be able to do the same) on the various worker nodes3. each worker is given a range of documents to process, pulls thesein from _all_docs

4. worker processes its portion of the database

5. worker returns its results to the view server which aggregates themup into the final view

The main issue here is how good your view server is; can it takegetting 1000's of responses at once? An HTTP view response would benice... I'm pretty sure that CouchDB could handle getting all therequests from workers. I think this could also allow for view of viewprocessing, without going through/maintaining an intermediate database.

Cheers
Simon

Parallel view generation (was Re: replicator options)

Reply via email to