On Fri, Feb 20, 2009 at 2:45 PM, Damien Katz <[email protected]> wrote: > > On Feb 20, 2009, at 4:37 PM, Stefan Karpinski wrote: > >>> >>> Trees would be overkill except for with very large clusters. >>> >> >>> With CouchDB map views, you need to combine results from every node in a >>> big merge sort. If you combine all results at a single node, the single >>> clients ability to simultaneously pull data and sort data from all other >>> nodes may become the bottleneck. So to parallelize, you have multiple >>> nodes >>> doing a merge sort of sub nodes , then sending those results to another >>> node >>> to be combined further, etc. The same with with the reduce views, but >>> instead of a merge sort it's just rereducing results. The natural "shape" >>> of >>> that computation is a tree, with only the final root node at the top >>> being >>> the bottleneck, but now it has to maintain connections and merge the sort >>> values from far fewer nodes. >>> >>> -Damien >> >> >> That makes sense and it clarifies one of my questions about this topic. Is >> the goal of partitioned clustering to increase performance for very large >> data sets, or to increase reliability? It would seem from this answere >> that >> the goal is to increase query performance by distributing the query >> processing, and not to increase reliability. > > > I see partitioning and clustering as 2 different things. Partitioning is > data partitioning, spreading the data out across nodes, no node having the > complete database. Clustering is nodes having the same, or nearly the same > data (they might be behind on replicating changes, but otherwise they have > the same data). > > Partitioning would primarily increase write performance (updates happening > concurrently on many nodes) and the size of the data set. Partitioning helps > with client read scalability, but only for document reads, not views > queries. Partitioning alone could reduce reliability, depending how tolerant > you are to missing portions of the database. > > Clustering would primarily address database reliability (failover), address > client read scalability for docs and views. Clustering doesn't help much > with write performance because even if you spread out the update load, the > replication as the cluster syncs up means every node gets the update anyway. > It might be useful to deal with update spikes, where you get a bunch of > updates at once and can wait for the replication delay to get everyone > synced back up. > > For really big, really reliable database, I'd have clusters of partitions, > where the database is partitioned N ways, each each partition have at least > M identical cluster members. Increase N for larger databases and update > load, M for higher availability and read load. >
Thanks for the clarification. Can you say anything about how you see rebalancing working? -- Chris Anderson http://jchris.mfdz.com
