Mahadev, comments inline: > -----Original Message----- > From: Mahadev Konar [mailto:maha...@yahoo-inc.com] > Sent: Wednesday, August 05, 2009 1:47 PM > To: zookeeper-dev@hadoop.apache.org > Subject: Re: Optimized WAN ZooKeeper Config : Multi-Ensemble configuration > > Todd, > Comments in line: > > > On 8/5/09 12:10 PM, "Todd Greenwood" <to...@audiencescience.com> wrote: > > > Flavio/Patrick/Mahadev - > > > > Thanks for your support to date. As I understand it, the sticky points > > w/ respect to WAN deployments are: > > > > 1. Leader Election: > > > > Leader elections in the WAN config (pod zk server weight = 0) is a bit > > troublesome (ZOOKEEPER-498) > Yes, until ZOOKEEPER-498 is fixed, you wont be able to use it with groups > and zero weight. > > > > > 2. Network Connectivity Required: > > > > ZooKeeper clients cannot read/write to ZK Servers if the Server does not > > have network connectivity to the quorum. In short, there is a hard > > requirement to have network connectivity in order for the clients to > > access the shared memory graph in ZK. > Yes > > > > > Alternative > > ----------- > > > > I have seen some discussion about in the past re: multi-ensemble > > solutions. Essentially, put one ensemble in each physical location > > (POD), and another in your DC, and have a fairly simple process > > coordinate synchronizing the various ensembles. If the POD writes can be > > confined to a sub-tree in the master graph, then this should be fairly > > simple. I'm imagining the following: > > > > DC (master) graph: > > /root/pods/1/data/item1 > > /root/pods/1/data/item2 > > /root/pods/1/data/item3 > > /root/pods/2 > > /root/pods/3 > > ...etc > > /root/shared/allpods/readonly/data/item1 > > /root/shared/allpods/readonly/data/item2 > > ...etc > > > > This has the advantage of minimizing cross pod traffic, which could be a > > real perf killer in an WAN. It also provides transacted writes in the > > PODs, even in the disconnected state. Clearly, another portion of the > > business logic has to reconcile the DC (master) graph such that each of > > the pods data items are processed, etc. > > > > Does anyone have any experience with this (pitfalls, suggestions, etc.?) > As far as I understand is that you mean that have a master Cluster with > other in a different data center syncing with the master (just a subtree)? > Is that correct? > > If yes, this is what one of our users in Yahoo! Search do. They have a > master cluster and a smaller cluster in a different datacenter and a > brdige > that copies data from the master cluster (only a subtree) to the smaller > one > and keeps them in syncs. >
Yes, this is exactly what I'm proposing. With the addition that I'll sync subtrees in both directions, and have a separate process reconcile data from the various pods, like so: #pod1 ensemble /root/a/b #pod2 ensemble /root/a/b #dc ensemble /root/shared/foo/bar # Mapping (modeled after perforce client config) # [ensemble]:[path] [ensemble]:[path] # sync pods to dc [POD1]:/root/... [DC]:/root/pods/POD1/... [POD2]:/root/... [DC]:/root/pods/POD2/... # sync dc to pods [DC]:/root/shared/... [POD1]:/shared/... [DC]:/root/shared/... [POD2]:/shared/... [DC]:/root/shared/... [POD3]:/shared/... Now, for our needs, we'd like the DC data aggregated, so I'll have another process handle aggregating the pod specific data like so: POD Data Aggregator: aggregate data in [DC]:/root/pods/POD(N) to [DC]:/root/aggregated/data. This is just off the top of my head. -Todd > > Thanks > mahadev > > > > -Todd