On Wed, Aug 7, 2013 at 1:50 PM, Guido Trotter <[email protected]> wrote:
> On Tue, Aug 6, 2013 at 6:29 PM, Iustin Pop <[email protected]> wrote: > > On Tue, Aug 06, 2013 at 03:56:24PM +0000, Michele Tartara wrote: > >> This describes the future planned structure of Ganeti daemons. > >> > >> Signed-off-by: Michele Tartara <[email protected]> > >> --- > >> Makefile.am | 1 + > >> doc/design-daemons.rst | 236 > ++++++++++++++++++++++++++++++++++++++++++++++++ > >> doc/design-draft.rst | 1 + > >> 3 files changed, 238 insertions(+) > >> create mode 100644 doc/design-daemons.rst > >> > >> diff --git a/Makefile.am b/Makefile.am > >> index 531197c..7714052 100644 > >> --- a/Makefile.am > >> +++ b/Makefile.am > >> @@ -422,6 +422,7 @@ docinput = \ > >> doc/design-cpu-pinning.rst \ > >> doc/design-device-uuid-name.rst \ > >> doc/design-draft.rst \ > >> + doc/design-daemons.rst \ > >> doc/design-htools-2.3.rst \ > >> doc/design-http-server.rst \ > >> doc/design-impexp2.rst \ > >> diff --git a/doc/design-daemons.rst b/doc/design-daemons.rst > >> new file mode 100644 > >> index 0000000..e10c942 > >> --- /dev/null > >> +++ b/doc/design-daemons.rst > >> @@ -0,0 +1,236 @@ > >> +========================== > >> +Ganeti daemons refactoring > >> +========================== > >> + > >> +.. contents:: :depth: 2 > >> + > >> +This is a design document detailing the plan for refactoring the > internal > >> +structure of Ganeti, and particularly the set of daemons it is divided > into. > >> + > >> + > >> +Current state and shortcomings > >> +============================== > >> + > >> +Ganeti is comprised of a growing number of daemons, each dealing with > part of > >> +the tasks the cluster has to face, and communicating with the other > daemons > >> +using a variety of protocol. > >> + > >> +Specifically, as of Ganeti 2.8, the situation is as follows: > >> + > >> +``Master daemon (MonD)`` > > > > MonD→typo? > > > Yes, of course. > >> + It is responsible for managing the entire cluster, and it's written > in Python. > >> + It is executed on a single node (the master node). It receives the > commands > >> + given by the cluster administrator (through the remote API daemon or > the > >> + command line tools) over the LUXI protocol. The master daemon is > responsible > >> + for creating and managing the jobs that will execute such commands, > and for > >> + managing the locks that ensure the cluster will not incur in race > conditions. > >> + > >> + Each job is managed by a separate Python thread, that interacts with > the node > >> + daemons via RPC calls. > >> + > >> + The master daemon is also responsible for managing the configuration > of the > >> + cluster, changing it when required by some job. It is also > responsible for > >> + copying the configuration to the other master candidates after > updating it. > >> + > >> +``RAPI daemon (RapiD)`` > >> + It is written in Python and runs on the master node only. It waits > for > >> + requests issued remotely through the remote API protocol. Then, it > forwards > >> + them, using the LUXI protocol, to the master daemon (if they are > commands) or > >> + to the query daemon if they are queries about the configuration > (including > >> + live status) of the cluster. > >> + > >> +``Node daemon (NodeD)`` > >> + It is written in Python. It runs on the VM-capable nodes. It is > responsible > >> + for receiving the master requests over RPC and execute them, using > the > >> + appropriate backend (hypervisors, DRBD, LVM, etc.). It also receives > requests > >> + over RPC for the execution of queries gathering live data on behalf > of the > >> + query daemon. > >> + > >> +``Configuration daemon (ConfD)`` > >> + It is written in Haskell. It runs on all the master candidates. > Since the > >> + configuration is replicated only on the master node, this daemon > exists in > >> + order to provide information about the configuration to nodes > needing them. > >> + The requests are done through ConfD's own protocol, HMAC signed, > >> + implemented over UDP, and meant to be used by parallely querying all > the > >> + master candidates (or a subset thereof) and getting the more up to > date > >> + answer. This is meant as a way to provide a robust service even in > case master > >> + is temporarily unavailable. > >> + > >> +``Query daemon (QueryD)`` > >> + It is written in Haskell. It runs on all the master candidates. It > replies > >> + to Luxi queries about the current status of the system, including > live data it > >> + obtains by querying the node daemons through RPCs. > >> + > >> +``Monitoring daemon (MonD)`` > >> + It is written in Haskell. It runs on all nodes, including the ones > that are > >> + not vm-capable. It is meant to provide information on the status of > the > >> + system. Such information is related only to the specific node the > daemon is > >> + running on, and it is provided as JSON encoded data over HTTP, to be > easily > >> + readable by external tools. > >> + The monitoring daemon communicates with ConfD to get information > about the > >> + configuration of the cluster. The choice of communicating with ConfD > instead > >> + of MasterD allows it to obtain configuration information even when > the cluster > >> + is heavily degraded (e.g.: when master and some, but not all, of the > master > >> + candidates are unreachable). > >> + > >> +The current structure of the Ganeti daemons is inefficient because > there are > >> +many different protocols involved, and each daemon needs to be able to > use > >> +multiple ones, and has to deal with doing different things, thus making > >> +sometimes unclear which daemon is responsible for performing a > specific task. > >> + > >> +Also, with the current configuration, jobs are managed by the master > daemon > >> +using python threads. This makes terminating a job after it has > started a > >> +difficult operation, and it is the main reason why this is not > possible yet. > >> + > >> +The master daemon currently has too many different tasks, that could > be handled > >> +better if split among different daemons. > >> + > >> + > >> +Proposed changes > >> +================ > >> + > >> +In order to improve on the current situation, a new daemon subdivision > is > >> +proposed, and presented hereafter. > >> + > >> +.. digraph:: "new-daemons-structure" > >> + > >> + {rank=same; ConfDR LuxiD;} > >> + node [shape=box] > >> + RapiD [label="RapiD [M]"] > >> + LuxiD [label="LuxiD [M]"] > >> + ConfDW [label="ConfDW [M]"] > >> + Jobs [label="Jobs [M]"] > >> + ConfDR [label="ConfDR [MC]"] > >> + MonD [label="MonD [All]"] > >> + NodeD [label="NodeD [VM-capable]"] > >> + p1 [shape=none, label=""] > >> + p2 [shape=none, label=""] > >> + p3 [shape=none, label=""] > >> + p4 [shape=none, label=""] > >> + configdata [shape=none, label="config.data"] > >> + locksdata [shape=none, label="locks.data"] > >> + > >> + RapiD -> LuxiD [label="LUXI"] > >> + LuxiD -> ConfDW [label="unix\nsockets"] > >> + LuxiD -> Jobs [label="fork/exec"] > >> + Jobs -> ConfDW > >> + Jobs -> NodeD [label="RPC"] > >> + LuxiD -> NodeD [label="RPC"] > >> + ConfDW -> ConfDR [label="push\nconfig\ndata"] > >> + ConfDW -> configdata > >> + ConfDW -> locksdata > >> + MonD -> ConfDR [label="ConfD proto"] > >> + p1 -> MonD [label="MonD proto"] > >> + p2 -> RapiD [label="RAPI"] > >> + p3 -> LuxiD [label="gnt-*\nclients"] > >> + p4 -> ConfDR [label="ConfD proto"] > >> + > >> +``LUXI daemon (LuxiD)`` > >> + It will be written in Haskell. It will run on the master node and it > will be > >> + the only LUXI server, replying to all the LUXI queries. These > includes both > >> + the queries about the live configuration of the cluster, previously > served by > >> + QueryD, and the commands actually changing the status of the cluster > by > >> + submitting jobs. Therefore, this daemon will also be the one > responsible with > >> + managing the job queue. When a job needs to be executed, the LuxiD > will spawn > >> + a separate process tasked with the execution of that specific job, > thus making > >> + it easier to terminate the job itself, if needeed. When a job > requires locks, > >> + LuxiD will request them to ConfDW > >> + > >> +``Configuration management daemon (ConfDW)`` > >> + It will run on the master node and it will be responsible for the > management > >> + of the authoritative copy of the cluster configuration (that is, it > will be > >> + the daemon actually modifying the ``config.data`` file). All the > requests of > >> + configuration changes will have to pass through this daemon. Having > a single > >> + point of configuration management will also allow Ganeti to get rid > of > >> + possible race conditions due to concurrent modifications of the > configuration. > >> + When the configuration is updated, it will have to push the received > changes > >> + to the ConfDR daemons, to keep them up to date. > >> + This daemon will also be the one responsible for managing the locks, > granting > >> + them to the jobs requesting them, and taking care of freeing them up > if the > >> + jobs holding them crash or are terminated before releasing them. > >> + Also, it should hold a serialized list of the locks and their owners > in a file > >> + (``locks.data``), so that it can keep track of their status in case > it crashes > >> + and needs to be restarted. > >> + Interaction with this daemon will be performed using Unix sockets. > >> + > >> +``Configuration query daemon (ConfDR)`` > >> + It is written in Haskell, and it corresponds to the old ConfD. It > will run on > >> + all the master candidates and it will serve information about the > the static > >> + configuration of the cluster (the one contained in ``config.data``). > The > >> + provided information will be highly available (as in: a response > will be > >> + available as long as a stable-enough connection between the client > and at > >> + least one working master candidate is available) and its freshness > will be > >> + best effort (the most recent reply from any of the master candidates > will be > >> + returned, but it might still be older than the one available through > ConfDW). > >> + The information will be served through the ConfD protocol. > > > > This new split means that master candidates will lose the (current) > > capability of actually responding to queries (as in gnt-* list) about > > current cluster state. > > > > If this is an intended change, I would suggest documenting it as such. > > > > I believe we broke this capability already in 2.8, as we split luxid > (initially queryd, now luxid in light of this design) out of confd to > avoid problems with the RPC certificate access being available on a > network-accessible daemon (which was a known issue). > > So the status currently is: > - 2.7 MC queries work, but only the non-rpc ones (which seems quite a > random set, and not a good useable functionality) > - 2.8 MC queries are broken altogether > > If we want this functionality we should explicitly design for it, have > luxid&confdW available (read only) on MCs, and use them. > Then RAPI (also read only) would be useful too, I guess. Sorry I > hadn't noticed that queries were supposed to be work on MCs by design: > we can definitely discuss that, but given the current stable releases > status the breakage at least is not there. > >> +``Rapi daemon (RapiD)`` > >> + It remains basically unchanged, with the only difference that all of > its LUXI > >> + query are directed towards LuxiD instead of being split between > MasterD and > >> + QueryD. > >> + > >> +``Monitoring daemon (MonD)`` > >> + It remains unaffected by the changes in this design document. It > will just get > >> + some of the data it needs from ConfDR instead of the old ConfD, but > the > >> + interfaces of the two are identical. > >> + > >> +``Node daemon (NodeD)`` > >> + It remains unaffected by the changes proposed in the design > document. The only > >> + difference being that it will receive its RPCs from LuxiD instead of > MasterD. > >> + > >> +This restructuring will allow us to reorganize and improve the > codebase, > >> +introducing cleaner interfaces and giving well defined and more > restricted tasks > >> +to each daemon. > >> + > >> +Furthermore, having more well-defined interfaces will allow us to have > easier > >> +upgrade procedures, and to work towards the possibility of upgrading > single > >> +components of a cluster one at a time, without the need for immediately > >> +upgrading the entire cluster in a single step. > >> + > >> + > >> +Implementation > >> +============== > >> + > >> +While performing this refactoring, we aim to increase the amount of > >> +Haskell code, thus benefiting from the additional type safety provided > by its > >> +wide compile-time checks. In particular, all the job queue management > and the > >> +configuration management daemon will be written in Haskell, taking > over the role > >> +currently fulfilled by Python code executed as part of MasterD. > >> + > >> +The changes describe by this design document are quite extensive, > therefore they > >> +awill not be implemented all at the same time, but through a sequence > of steps, > >> +leaving the codebase in a consistent and usable state. > >> + > >> +#. Rename QueryD to LuxiD. > >> + A part of LuxiD, the one replying to configuration > >> + queries including live information about the system, already exists > in the > >> + form of QueryD. This is being renamed to LuxiD, and will form the > first part > >> + of the new daemon. NB: this is happening in Ganeti 2.8. > >> + > >> +#. Let LuxiD be the interface for the queries and MasterD be their > executor. > >> + Currently, MasterD is the only responsible for receiving and > executing LUXI > >> + queries, and for managing the jobs they create. > >> + Receiving the queries and managing the job queue will be extracted > from > >> + MasterD into LuxiD. > >> + Actually executing jobs will still be done by MasterD, that > contains all the > >> + logic for doing that and for properly managing locks and the > configuration. > >> + MasterD still has to ask back for cancellations. > >> + > >> +#. Extract ConfDW from MasterD. > >> + The logic for managing the configuration file is factored out to the > >> + dedicated ConfDW daemon. > >> + > >> +#. Extract locking management from MasterD. > >> + The logic for managing and granting locks is extracted to ConfDW as > well. > >> + This step can be executed on its own or at the same time as the > previous one. > >> + > >> +#. Jobs are executed as processes. > >> + The logic for running jobs and for sending RPCs to NodeD is > rewritten in > >> + Haskell, so that each job can be managed by an independent process. > > > > From just reading this design, it's not clear what happens with the LUs. > > Will they remain written in Python? Will they be rewritten? If > > remainining in Python, how will they interact with NodeD? > > > > We can and should indeed clarify that: right now the plan is to keep > them written in python, and execute them in processes forked by jobD. > Interaction with NodeD would be via RPC as of today. What would change > is the interaction with locks and the config, which of course must be > detailed further, before proceeding. > I'll rewrite this part making it more clear. > > > I would suggest expanding this last paragraph; to an external reader, > > it's not obvious what the planned changes are in this particular area. > > > > Thanks for the feedback!! > > Guido > Thanks to both of you. Michele -- Google Germany GmbH Dienerstr. 12 80331 München Registergericht und -nummer: Hamburg, HRB 86891 Sitz der Gesellschaft: Hamburg Geschäftsführer: Graham Law, Christine Elizabeth Flores
