Better get comfortable, everyone, because I might ramble on for a bit.

Over the last few days, I've been looking into the issue of how to manage our 
own instances of etcd (or something similar) as part of our 4.0 configuration 
store.  This is highly relevant for GlusterD 2.0, which would be both a 
consumer of the service and (possibly) a manager for the daemons that provide 
it.  It's also relevant for NSR, which needs a similar kind of highly-available 
highly-consistent store for information about terms.  Just about any other 
component might be able to take good advantage of such a facility if it were 
available, such as DHT 2.0 using it for layout information, and I encourage 
anyone working on 4.0 to think about how it can make other components simpler.  
(BTW, Shyam, that's just a hypothetical example.  Don't take it any more 
seriously than you want to.)

This is not the first time I've looked into this.  During the previous round of 
NSR development, I implemented some code to manage etcd daemons from within 
GlusterD:

    http://review.gluster.org/#/c/8887/

That code's junk.  We shouldn't use anything more than small pieces of it.  
Among other problems, it nukes the etcd information when a new node joins.  
That was fine for what we were doing with NSR at the time, but clearly can't 
work in real life.  I've also been looking at the new-ish etcd interfaces for 
cluster management:

    https://github.com/coreos/etcd/blob/master/Documentation/other_apis.md

I'm pretty sure these didn't exist when I was last looking at this stuff, but I 
could be wrong.  In any case, they look pretty nice.  Much like our own "probe" 
mechanism, it looks like we can start a single-node cluster and then add others 
into that cluster by talking to one of the current members.  In fact, that 
similarity suggests how we might manage our instances of etcd.

(1) Each GlusterD *initially* starts its own private instance of etcd.

(2) When we probe from a node X to a node Y, the probe message includes 
information about X's etcd server(s).

(3) Upon receipt of a probe, Y can (depending on a flag) either *use* X's etcd 
cluster or *join* it.  Either way, it has to shut down its own one-node 
cluster.  In the JOIN case, this implies that X will send the appropriate etcd 
command to its local instance (from whence it will be propagated to the others).

(4) Therefore, the CLI/REST interfaces to initiate a probe need an option to 
control this join/use flag.  Default should be JOIN for small clusters, where 
it's not a problem for all nodes to be etcd servers as well.

(5) For larger clusters, the administrator might start to specify USE instead 
of JOIN after a while.  There might also need to be separate CLI/REST 
interfaces to toggle this state without any probe involved.

(6) For detach/deprobe, we simply undo the things we did in (3).

With all of this in place, probes would become one-time exchanges.  There's no 
need for GlusterD daemons to keep probing each other when they can just "check 
in" with etcd (which is doing something very similar internally).  Instead of 
constantly sending its own probe/heartbeat messages and keeping track of which 
others nodes' messages have been missed, each GlusterD would simply use its 
node UUID to create a time-limited key in etcd, and issue watches on other 
nodes' keys.  This is not quite as convenient as ZooKeeper's ephemerals, but 
it's still a lot better than what we're doing now.

I'd be tempted to implement this myself, but for now it's probably more 
important to work on NSR itself and for that I can just use an external etcd 
cluster instead.  Maybe later in the 4.0 integration phase, if nobody else has 
beaten me to it, I'll take a swing at it.  Until then, does anyone else have 
any thoughts on the proposal?
_______________________________________________
Gluster-devel mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-devel

Reply via email to