RE: Mounting a remote Zookeeper

Alexander Shraer Thu, 09 Jun 2011 17:12:35 -0700

This is a preliminary proposal, so everything is still open. Still, I think 
there are many advantages over the previous namespace partitioning proposal 
(http://wiki.apache.org/hadoop/ZooKeeper/PartitionedZookeeper) that wasn't 
implemented AFAIK. The idea here is to make much smaller and more intuitive 
changes. 
For example, the previous proposal did not offer any ordering guarantees across 
partitions. Also - in Linux mount you don't need to specify for each new file 
which mount point the file belongs to - we can exploit the tree structure to 
infer that instead of creating and maintaining an additional hierarchy like in 
the previous proposal.

> what happens when a client does a read on the remote ZK cluster. does the 
> read always get 
> forwarded to the remote cluster?

No. The idea is to identify when inter-cluster communication is necessary to 
maintain sequential consistency and otherwise avoid it. In the twiki we propose 
such a possible rule. For example, if you read from a remote partition that 
didn't mount any part of your local namespace, it's ok to return an old value. 
In any case, the read is never forwarded to the remote cluster - even if 
inter-cluster communication is necessary, we sync the observer with the remote 
leader and then read from the observer.

> in your proposal, what happens if an a client creates an ephemeral
> node on the remote ZK cluster. who does the failure detection and clean up?

You're right, we should definitely address that in the twiki. I think that in 
any case a cluster should only monitor the clients connected to that cluster 
and not clients connected to remote clusters. So if we support creating remote 
ephemeral nodes I think failure detection should be done locally and the remote 
cluster should subscribe to relevant local failure events and be notified. 

> what happens if the request to the remote cluster hangs?

A user can determine what happens in this case. If he wants all his following 
requests to fail, a remote request will block all his following requests. 
Otherwise a remote request can fail and still his following local requests can 
succeed.

Thanks,
Alex

> -----Original Message-----
> From: Benjamin Reed [mailto:[email protected]]
> Sent: Thursday, June 09, 2011 4:05 PM
> To: [email protected]
> Subject: Re: Mounting a remote Zookeeper
> 
> this is a small nit, but i think the partition proposal works a bit
> more like a mount point than your proposal. when you mount a file
> system, the mount isn't transparent. two mounted file systems can have
> files with the same inode number, for example. you also can't do some
> things like a rename across file system boundaries.
> 
> in your proposal, what happens if an a client creates an ephemeral
> node on the remote ZK cluster. who does the failure detection and
> clean up? it also wasn't clear what happens when a client does a read
> on the remote ZK cluster. does the read always get forwarded to the
> remote cluster? also what happens if the request to the remote cluster
> hangs?
> 
> thanx
> ben
> 
> On Thu, Jun 9, 2011 at 11:41 AM, Alexander Shraer <shralex@yahoo-
> inc.com> wrote:
> > Hi,
> >
> > We're considering working on a new feature that will allow "mounting"
> part of the namespace of one ZK cluster into another ZK cluster. The
> goal is essentially to be able to partition a ZK namespace while
> preserving current ZK semantics as much as possible.
> > More details are here:
> http://wiki.apache.org/hadoop/ZooKeeper/MountRemoteZookeeper
> >
> > It would be great to get your feedback and especially please let us
> know if you think your application can benefit from this feature.
> >
> > Thanks,
> > Alex Shraer and Eddie Bortnikov
> >
> >
> >

RE: Mounting a remote Zookeeper

Reply via email to