This is a preliminary proposal, so everything is still open. Still, I think there are many advantages over the previous namespace partitioning proposal (http://wiki.apache.org/hadoop/ZooKeeper/PartitionedZookeeper) that wasn't implemented AFAIK. The idea here is to make much smaller and more intuitive changes. For example, the previous proposal did not offer any ordering guarantees across partitions. Also - in Linux mount you don't need to specify for each new file which mount point the file belongs to - we can exploit the tree structure to infer that instead of creating and maintaining an additional hierarchy like in the previous proposal.
> what happens when a client does a read on the remote ZK cluster. does the > read always get > forwarded to the remote cluster? No. The idea is to identify when inter-cluster communication is necessary to maintain sequential consistency and otherwise avoid it. In the twiki we propose such a possible rule. For example, if you read from a remote partition that didn't mount any part of your local namespace, it's ok to return an old value. In any case, the read is never forwarded to the remote cluster - even if inter-cluster communication is necessary, we sync the observer with the remote leader and then read from the observer. > in your proposal, what happens if an a client creates an ephemeral > node on the remote ZK cluster. who does the failure detection and clean up? You're right, we should definitely address that in the twiki. I think that in any case a cluster should only monitor the clients connected to that cluster and not clients connected to remote clusters. So if we support creating remote ephemeral nodes I think failure detection should be done locally and the remote cluster should subscribe to relevant local failure events and be notified. > what happens if the request to the remote cluster hangs? A user can determine what happens in this case. If he wants all his following requests to fail, a remote request will block all his following requests. Otherwise a remote request can fail and still his following local requests can succeed. Thanks, Alex > -----Original Message----- > From: Benjamin Reed [mailto:[email protected]] > Sent: Thursday, June 09, 2011 4:05 PM > To: [email protected] > Subject: Re: Mounting a remote Zookeeper > > this is a small nit, but i think the partition proposal works a bit > more like a mount point than your proposal. when you mount a file > system, the mount isn't transparent. two mounted file systems can have > files with the same inode number, for example. you also can't do some > things like a rename across file system boundaries. > > in your proposal, what happens if an a client creates an ephemeral > node on the remote ZK cluster. who does the failure detection and > clean up? it also wasn't clear what happens when a client does a read > on the remote ZK cluster. does the read always get forwarded to the > remote cluster? also what happens if the request to the remote cluster > hangs? > > thanx > ben > > On Thu, Jun 9, 2011 at 11:41 AM, Alexander Shraer <shralex@yahoo- > inc.com> wrote: > > Hi, > > > > We're considering working on a new feature that will allow "mounting" > part of the namespace of one ZK cluster into another ZK cluster. The > goal is essentially to be able to partition a ZK namespace while > preserving current ZK semantics as much as possible. > > More details are here: > http://wiki.apache.org/hadoop/ZooKeeper/MountRemoteZookeeper > > > > It would be great to get your feedback and especially please let us > know if you think your application can benefit from this feature. > > > > Thanks, > > Alex Shraer and Eddie Bortnikov > > > > > >
