Having the replicator handle chaining views would really help people who are already hacking this together with scripts. So, I'd definitely +1 the idea. Isn't view size and indexing time a separate problem from designing this replicator API?
On Sun, Jan 24, 2010 at 9:47 PM, Chris Anderson <[email protected]> wrote: > On Sun, Jan 24, 2010 at 5:16 PM, Glenn Rempe <[email protected]> wrote: >> On Sun, Jan 24, 2010 at 2:11 PM, Chris Anderson <[email protected]> wrote: >> >>> On Sun, Jan 24, 2010 at 2:04 PM, Glenn Rempe <[email protected]> wrote: >>> > On Sun, Jan 24, 2010 at 12:09 AM, Chris Anderson <[email protected]> >>> wrote: >>> > >>> >> Devs, >>> >> >>> >> I've been thinking there are a few simple options that would magnify >>> >> the power of the replicator a lot. >>> >> >>> >> ... >>> >> The fun one is chained map reduce. It occurred to me the other night >>> >> that simplest way to present a chainable map reduce abstraction to >>> >> users is through the replicator. The action "copy these view rows to a >>> >> new db" is a natural fit for the replicator. I imagine this would be >>> >> super useful to people doing big messy data munging, and it wouldn't >>> >> be too hard for the replicator to handle. >>> >> >>> >> >>> > I like this idea as well, as chainable map/reduce has been something I >>> think >>> > a lot of people would like to use. The thing I am concerned about, and >>> > which is related to another ongoing thread, is the size of views on disk >>> and >>> > the slowness of generating them. I fear that we would end up ballooning >>> > views on disk to a size that is unmanageable if we chained them. I have >>> an >>> > app in production with 50m rows, whose DB has grown to >100GB, and the >>> views >>> > take up approx 800GB (!). I don't think I could afford the disk space to >>> > even consider using this especially when you consider that in order to >>> > compact a DB or view you need roughly 2x the disk space of the files on >>> > disk. >>> > >>> > I also worry about the time to generate chained views, when the time >>> needed >>> > for generating views currently is already a major weak point of CouchDB >>> > (Generating my views took more than a week). >>> > >>> > In practice, I think only those with relatively small DB's would be able >>> to >>> > take advantage of this feature. >>> > >>> >>> For large data, you'll want a cluster. The same holds true for other >>> Map Reduce frameworks like Hadoop or Google's stuff. >>> >>> >> >> That would not resolve the issue I mentioned where views can be a multiple >> in size of the original data DB. I have about 9 views in a design doc, and >> my resultant view files on disk are about 9x the size of the original DB >> data. >> >> How would sharding this across multiple DBs in a cluster resolve this? You >> would still end up with views that are some multiple in size of their >> original sharded DB. Compounded by how many replicas you have of that view >> data for chained M/R. >> >> >>> I'd be interested if anyone with partitioned CouchDB query experience >>> (Lounger or otherwise) can comment on view generation time when >>> parallelized across multiple machines. >>> >>> >> I would also be interested in seeing any architectures that make use of this >> to parallelize view generation. I'm not sure your example of Hadoop or >> Google M/R are really valid because they provide file system abstractions >> (e.g. Hadoop FS) for automatically streaming a single copy of the data to >> where it is needed to be Mapped/Reduced and CouchDB has nothing similar. >> >> http://hadoop.apache.org/common/docs/current/hdfs_design.html >> >> Don't get me wrong, I would love to see these things happen, I just wonder >> if there are other issues that need to be resolved first before this is >> practical for anything but a small dataset. >> > > I know Hadoop and Couch are dissimilar, but the way to parallelize > CouchDB view generation is with a partitioned cluster like > CouchDB-Lounge or the Cloudant stuff. > > It doesn't help much with the size inefficiencies but will help with > generation time. > > Chris > > > -- > Chris Anderson > http://jchrisa.net > http://couch.io >
