On Sun, Jan 24, 2010 at 8:16 PM, Glenn Rempe <[email protected]> wrote: > On Sun, Jan 24, 2010 at 2:11 PM, Chris Anderson <[email protected]> wrote: > >> On Sun, Jan 24, 2010 at 2:04 PM, Glenn Rempe <[email protected]> wrote: >> > On Sun, Jan 24, 2010 at 12:09 AM, Chris Anderson <[email protected]> >> wrote: >> > >> >> Devs, >> >> >> >> I've been thinking there are a few simple options that would magnify >> >> the power of the replicator a lot. >> >> >> >> ... >> >> The fun one is chained map reduce. It occurred to me the other night >> >> that simplest way to present a chainable map reduce abstraction to >> >> users is through the replicator. The action "copy these view rows to a >> >> new db" is a natural fit for the replicator. I imagine this would be >> >> super useful to people doing big messy data munging, and it wouldn't >> >> be too hard for the replicator to handle. >> >> >> >> >> > I like this idea as well, as chainable map/reduce has been something I >> think >> > a lot of people would like to use. The thing I am concerned about, and >> > which is related to another ongoing thread, is the size of views on disk >> and >> > the slowness of generating them. I fear that we would end up ballooning >> > views on disk to a size that is unmanageable if we chained them. I have >> an >> > app in production with 50m rows, whose DB has grown to >100GB, and the >> views >> > take up approx 800GB (!). I don't think I could afford the disk space to >> > even consider using this especially when you consider that in order to >> > compact a DB or view you need roughly 2x the disk space of the files on >> > disk. >> > >> > I also worry about the time to generate chained views, when the time >> needed >> > for generating views currently is already a major weak point of CouchDB >> > (Generating my views took more than a week). >> > >> > In practice, I think only those with relatively small DB's would be able >> to >> > take advantage of this feature. >> > >> >> For large data, you'll want a cluster. The same holds true for other >> Map Reduce frameworks like Hadoop or Google's stuff. >> >> > > That would not resolve the issue I mentioned where views can be a multiple > in size of the original data DB. I have about 9 views in a design doc, and > my resultant view files on disk are about 9x the size of the original DB > data. > > How would sharding this across multiple DBs in a cluster resolve this? You > would still end up with views that are some multiple in size of their > original sharded DB. Compounded by how many replicas you have of that view > data for chained M/R. > > >> I'd be interested if anyone with partitioned CouchDB query experience >> (Lounger or otherwise) can comment on view generation time when >> parallelized across multiple machines. >> >> > I would also be interested in seeing any architectures that make use of this > to parallelize view generation. I'm not sure your example of Hadoop or > Google M/R are really valid because they provide file system abstractions > (e.g. Hadoop FS) for automatically streaming a single copy of the data to > where it is needed to be Mapped/Reduced and CouchDB has nothing similar. > > http://hadoop.apache.org/common/docs/current/hdfs_design.html > > Don't get me wrong, I would love to see these things happen, I just wonder > if there are other issues that need to be resolved first before this is > practical for anything but a small dataset. >
Theoretically the bigger your view is, the more inflated it'll become. Those would be some interesting numbers. Seems like they'd be helpful in choosing shard sizes and so on.
