Hi,
This is OT for the original discussion imho
On 25 Jan 2010, at 01:16, Glenn Rempe wrote:
I'd be interested if anyone with partitioned CouchDB query experience
(Lounger or otherwise) can comment on view generation time when
parallelized across multiple machines.
I would also be interested in seeing any architectures that make use
of this
to parallelize view generation. I'm not sure your example of Hadoop
or
Google M/R are really valid because they provide file system
abstractions
(e.g. Hadoop FS) for automatically streaming a single copy of the
data to
where it is needed to be Mapped/Reduced and CouchDB has nothing
similar.
IMHO something like HDFS isn't needed, since there's already a simple,
scalable way of getting at the data. What I'd like (to have time to
work on...) is the following:
1. be able to configure a pipeline of documents that are sent to the
view server
1a. be able to set the size of that pipeline to 0, which just sends a
sane header (there are N documents in the database)
2. view server spawns off child processes (I'm thinking Disco, but
Hadoop would be able to do the same) on the various worker nodes
3. each worker is given a range of documents to process, pulls these
in from _all_docs
4. worker processes its portion of the database
5. worker returns its results to the view server which aggregates them
up into the final view
The main issue here is how good your view server is; can it take
getting 1000's of responses at once? An HTTP view response would be
nice... I'm pretty sure that CouchDB could handle getting all the
requests from workers. I think this could also allow for view of view
processing, without going through/maintaining an intermediate database.
Cheers
Simon