On Sun, Jan 24, 2010 at 5:16 PM, Glenn Rempe <[email protected]> wrote:
> On Sun, Jan 24, 2010 at 2:11 PM, Chris Anderson <[email protected]> wrote:
>
>> On Sun, Jan 24, 2010 at 2:04 PM, Glenn Rempe <[email protected]> wrote:
>> > On Sun, Jan 24, 2010 at 12:09 AM, Chris Anderson <[email protected]>
>> wrote:
>> >
>> >> Devs,
>> >>
>> >> I've been thinking there are a few simple options that would magnify
>> >> the power of the replicator a lot.
>> >>
>> >> ...
>> >> The fun one is chained map reduce. It occurred to me the other night
>> >> that simplest way to present a chainable map reduce abstraction to
>> >> users is through the replicator. The action "copy these view rows to a
>> >> new db" is a natural fit for the replicator. I imagine this would be
>> >> super useful to people doing big messy data munging, and it wouldn't
>> >> be too hard for the replicator to handle.
>> >>
>> >>
>> > I like this idea as well, as chainable map/reduce has been something I
>> think
>> > a lot of people would like to use.  The thing I am concerned about, and
>> > which is related to another ongoing thread, is the size of views on disk
>> and
>> > the slowness of generating them.  I fear that we would end up ballooning
>> > views on disk to a size that is unmanageable if we chained them.  I have
>> an
>> > app in production with 50m rows, whose DB has grown to >100GB, and the
>> views
>> > take up approx 800GB (!). I don't think I could afford the disk space to
>> > even consider using this especially when you consider that in order to
>> > compact a DB or view you need roughly 2x the disk space of the files on
>> > disk.
>> >
>> > I also worry about the time to generate chained views, when the time
>> needed
>> > for generating views currently is already a major weak point of CouchDB
>> > (Generating my views took more than a week).
>> >
>> > In practice, I think only those with relatively small DB's would be able
>> to
>> > take advantage of this feature.
>> >
>>
>> For large data, you'll want a cluster. The same holds true for other
>> Map Reduce frameworks like Hadoop or Google's stuff.
>>
>>
>
> That would not resolve the issue I mentioned where views can be a multiple
> in size of the original data DB.  I have about 9 views in a design doc, and
> my resultant view files on disk are about 9x the size of the original DB
> data.
>
> How would sharding this across multiple DBs in a cluster resolve this?  You
> would still end up with views that are some multiple in size of their
> original sharded DB. Compounded by how many replicas you have of that view
> data for chained M/R.
>
>
>> I'd be interested if anyone with partitioned CouchDB query experience
>> (Lounger or otherwise) can comment on view generation time when
>> parallelized across multiple machines.
>>
>>
> I would also be interested in seeing any architectures that make use of this
> to parallelize view generation.  I'm not sure your example of Hadoop or
> Google M/R are really valid because they provide file system abstractions
> (e.g. Hadoop FS) for automatically streaming a single copy of the data to
> where it is needed to be Mapped/Reduced and CouchDB has nothing similar.
>
> http://hadoop.apache.org/common/docs/current/hdfs_design.html
>
> Don't get me wrong, I would love to see these things happen, I just wonder
> if there are other issues that need to be resolved first before this is
> practical for anything but a small dataset.
>

I know Hadoop and Couch are dissimilar, but the way to parallelize
CouchDB view generation is with a partitioned cluster like
CouchDB-Lounge or the Cloudant stuff.

It doesn't help much with the size inefficiencies but will help with
generation time.

Chris


-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Reply via email to