On Feb 20, 2009, at 1:55 PM, Stefan Karpinski wrote:
Hi, I thought I'd introduce myself since I'm new here on the couchdb
list. I'm Stefan Karpinski. I've worked in the Monitoring Group at
Akamai, Operations R&D at Citrix Online, and I'm nearly done with a
PhD in computer networking at the moment. So I guess I've thought
about this kind of stuff a bit ;-)
I'm curious what the motivation behind a tree topology is. Not that
it's not a viable approach, just why that and not a load-balancer in
front of a bunch of "leaves" with lateral propagation between the
leaves? Why should the load-balancing/proxying/caching node even be
running couchdb?
One reason I can see for a tree topology would be the hierarchical
cache effect. But that would likely only make sense in certain
circumstances. Being able to configure the topology to meet various
needs, rather than enforcing one particular topology makes more sense
to me overall.
Trees would be overkill except for with very large clusters.
With CouchDB map views, you need to combine results from every node in
a big merge sort. If you combine all results at a single node, the
single clients ability to simultaneously pull data and sort data from
all other nodes may become the bottleneck. So to parallelize, you have
multiple nodes doing a merge sort of sub nodes , then sending those
results to another node to be combined further, etc. The same with
with the reduce views, but instead of a merge sort it's just
rereducing results. The natural "shape" of that computation is a tree,
with only the final root node at the top being the bottleneck, but now
it has to maintain connections and merge the sort values from far
fewer nodes.
-Damien