Better get comfortable, everyone, because I might ramble on for a bit.
Over the last few days, I've been looking into the issue of how to manage our
own instances of etcd (or something similar) as part of our 4.0 configuration
store. This is highly relevant for GlusterD 2.0, which would be both a
consumer of the service and (possibly) a manager for the daemons that provide
it. It's also relevant for NSR, which needs a similar kind of highly-available
highly-consistent store for information about terms. Just about any other
component might be able to take good advantage of such a facility if it were
available, such as DHT 2.0 using it for layout information, and I encourage
anyone working on 4.0 to think about how it can make other components simpler.
(BTW, Shyam, that's just a hypothetical example. Don't take it any more
seriously than you want to.)
This is not the first time I've looked into this. During the previous round of
NSR development, I implemented some code to manage etcd daemons from within
GlusterD:
http://review.gluster.org/#/c/8887/
That code's junk. We shouldn't use anything more than small pieces of it.
Among other problems, it nukes the etcd information when a new node joins.
That was fine for what we were doing with NSR at the time, but clearly can't
work in real life. I've also been looking at the new-ish etcd interfaces for
cluster management:
https://github.com/coreos/etcd/blob/master/Documentation/other_apis.md
I'm pretty sure these didn't exist when I was last looking at this stuff, but I
could be wrong. In any case, they look pretty nice. Much like our own "probe"
mechanism, it looks like we can start a single-node cluster and then add others
into that cluster by talking to one of the current members. In fact, that
similarity suggests how we might manage our instances of etcd.
(1) Each GlusterD *initially* starts its own private instance of etcd.
(2) When we probe from a node X to a node Y, the probe message includes
information about X's etcd server(s).
(3) Upon receipt of a probe, Y can (depending on a flag) either *use* X's etcd
cluster or *join* it. Either way, it has to shut down its own one-node
cluster. In the JOIN case, this implies that X will send the appropriate etcd
command to its local instance (from whence it will be propagated to the others).
(4) Therefore, the CLI/REST interfaces to initiate a probe need an option to
control this join/use flag. Default should be JOIN for small clusters, where
it's not a problem for all nodes to be etcd servers as well.
(5) For larger clusters, the administrator might start to specify USE instead
of JOIN after a while. There might also need to be separate CLI/REST
interfaces to toggle this state without any probe involved.
(6) For detach/deprobe, we simply undo the things we did in (3).
With all of this in place, probes would become one-time exchanges. There's no
need for GlusterD daemons to keep probing each other when they can just "check
in" with etcd (which is doing something very similar internally). Instead of
constantly sending its own probe/heartbeat messages and keeping track of which
others nodes' messages have been missed, each GlusterD would simply use its
node UUID to create a time-limited key in etcd, and issue watches on other
nodes' keys. This is not quite as convenient as ZooKeeper's ephemerals, but
it's still a lot better than what we're doing now.
I'd be tempted to implement this myself, but for now it's probably more
important to work on NSR itself and for that I can just use an external etcd
cluster instead. Maybe later in the 4.0 integration phase, if nobody else has
beaten me to it, I'll take a swing at it. Until then, does anyone else have
any thoughts on the proposal?
_______________________________________________
Gluster-devel mailing list
[email protected]
http://www.gluster.org/mailman/listinfo/gluster-devel