Considering that we're opting for a WAN deployment that is not going to use groups, weights, etc., and that we are going to implement an ensemble-to-ensemble sync mechanism...what version of zookeeper do you recommend?
> -----Original Message----- > From: Todd Greenwood > Sent: Wednesday, August 05, 2009 2:21 PM > To: 'zookeeper-dev@hadoop.apache.org' > Subject: RE: Optimized WAN ZooKeeper Config : Multi-Ensemble configuration > > Mahadev, comments inline: > > > -----Original Message----- > > From: Mahadev Konar [mailto:maha...@yahoo-inc.com] > > Sent: Wednesday, August 05, 2009 1:47 PM > > To: zookeeper-dev@hadoop.apache.org > > Subject: Re: Optimized WAN ZooKeeper Config : Multi-Ensemble > configuration > > > > Todd, > > Comments in line: > > > > > > On 8/5/09 12:10 PM, "Todd Greenwood" <to...@audiencescience.com> wrote: > > > > > Flavio/Patrick/Mahadev - > > > > > > Thanks for your support to date. As I understand it, the sticky points > > > w/ respect to WAN deployments are: > > > > > > 1. Leader Election: > > > > > > Leader elections in the WAN config (pod zk server weight = 0) is a bit > > > troublesome (ZOOKEEPER-498) > > Yes, until ZOOKEEPER-498 is fixed, you wont be able to use it with > groups > > and zero weight. > > > > > > > > 2. Network Connectivity Required: > > > > > > ZooKeeper clients cannot read/write to ZK Servers if the Server does > not > > > have network connectivity to the quorum. In short, there is a hard > > > requirement to have network connectivity in order for the clients to > > > access the shared memory graph in ZK. > > Yes > > > > > > > > Alternative > > > ----------- > > > > > > I have seen some discussion about in the past re: multi-ensemble > > > solutions. Essentially, put one ensemble in each physical location > > > (POD), and another in your DC, and have a fairly simple process > > > coordinate synchronizing the various ensembles. If the POD writes can > be > > > confined to a sub-tree in the master graph, then this should be fairly > > > simple. I'm imagining the following: > > > > > > DC (master) graph: > > > /root/pods/1/data/item1 > > > /root/pods/1/data/item2 > > > /root/pods/1/data/item3 > > > /root/pods/2 > > > /root/pods/3 > > > ...etc > > > /root/shared/allpods/readonly/data/item1 > > > /root/shared/allpods/readonly/data/item2 > > > ...etc > > > > > > This has the advantage of minimizing cross pod traffic, which could be > a > > > real perf killer in an WAN. It also provides transacted writes in the > > > PODs, even in the disconnected state. Clearly, another portion of the > > > business logic has to reconcile the DC (master) graph such that each > of > > > the pods data items are processed, etc. > > > > > > Does anyone have any experience with this (pitfalls, suggestions, > etc.?) > > As far as I understand is that you mean that have a master Cluster with > > other in a different data center syncing with the master (just a > subtree)? > > Is that correct? > > > > If yes, this is what one of our users in Yahoo! Search do. They have a > > master cluster and a smaller cluster in a different datacenter and a > > brdige > > that copies data from the master cluster (only a subtree) to the smaller > > one > > and keeps them in syncs. > > > > Yes, this is exactly what I'm proposing. With the addition that I'll sync > subtrees in both directions, and have a separate process reconcile data > from the various pods, like so: > > #pod1 ensemble > /root/a/b > > #pod2 ensemble > /root/a/b > > #dc ensemble > /root/shared/foo/bar > > # Mapping (modeled after perforce client config) > # [ensemble]:[path] [ensemble]:[path] > # sync pods to dc > [POD1]:/root/... [DC]:/root/pods/POD1/... > [POD2]:/root/... [DC]:/root/pods/POD2/... > # sync dc to pods > [DC]:/root/shared/... [POD1]:/shared/... > [DC]:/root/shared/... [POD2]:/shared/... > [DC]:/root/shared/... [POD3]:/shared/... > > Now, for our needs, we'd like the DC data aggregated, so I'll have another > process handle aggregating the pod specific data like so: > > POD Data Aggregator: aggregate data in [DC]:/root/pods/POD(N) to > [DC]:/root/aggregated/data. > > This is just off the top of my head. > > -Todd > > > > > Thanks > > mahadev > > > > > > -Todd