On Wed, Aug 7, 2013 at 1:56 PM, Guido Trotter <[email protected]> wrote:
> On Wed, Aug 7, 2013 at 9:36 AM, Thomas Thrainer <[email protected]> > wrote: > > > > > > > > On Tue, Aug 6, 2013 at 5:56 PM, Michele Tartara <[email protected]> > wrote: > >> > >> This describes the future planned structure of Ganeti daemons. > >> > >> Signed-off-by: Michele Tartara <[email protected]> > >> --- > >> Makefile.am | 1 + > >> doc/design-daemons.rst | 236 > >> ++++++++++++++++++++++++++++++++++++++++++++++++ > >> doc/design-draft.rst | 1 + > >> 3 files changed, 238 insertions(+) > >> create mode 100644 doc/design-daemons.rst > >> > >> diff --git a/Makefile.am b/Makefile.am > >> index 531197c..7714052 100644 > >> --- a/Makefile.am > >> +++ b/Makefile.am > >> @@ -422,6 +422,7 @@ docinput = \ > >> doc/design-cpu-pinning.rst \ > >> doc/design-device-uuid-name.rst \ > >> doc/design-draft.rst \ > >> + doc/design-daemons.rst \ > >> doc/design-htools-2.3.rst \ > >> doc/design-http-server.rst \ > >> doc/design-impexp2.rst \ > >> diff --git a/doc/design-daemons.rst b/doc/design-daemons.rst > >> new file mode 100644 > >> index 0000000..e10c942 > >> --- /dev/null > >> +++ b/doc/design-daemons.rst > >> @@ -0,0 +1,236 @@ > >> +========================== > >> +Ganeti daemons refactoring > >> +========================== > >> + > >> +.. contents:: :depth: 2 > >> + > >> +This is a design document detailing the plan for refactoring the > internal > >> +structure of Ganeti, and particularly the set of daemons it is divided > >> into. > >> + > >> + > >> +Current state and shortcomings > >> +============================== > >> + > >> +Ganeti is comprised of a growing number of daemons, each dealing with > >> part of > >> +the tasks the cluster has to face, and communicating with the other > >> daemons > >> +using a variety of protocol. > > > > s/protocol/protocols/ > Ok. > >> > >> + > >> +Specifically, as of Ganeti 2.8, the situation is as follows: > >> + > >> +``Master daemon (MonD)`` > >> + It is responsible for managing the entire cluster, and it's written > in > >> Python. > >> + It is executed on a single node (the master node). It receives the > >> commands > >> + given by the cluster administrator (through the remote API daemon or > >> the > >> + command line tools) over the LUXI protocol. The master daemon is > >> responsible > >> + for creating and managing the jobs that will execute such commands, > and > >> for > >> + managing the locks that ensure the cluster will not incur in race > >> conditions. > >> + > >> + Each job is managed by a separate Python thread, that interacts with > >> the node > >> + daemons via RPC calls. > >> + > >> + The master daemon is also responsible for managing the configuration > of > >> the > >> + cluster, changing it when required by some job. It is also > responsible > >> for > >> + copying the configuration to the other master candidates after > updating > >> it. > >> + > >> +``RAPI daemon (RapiD)`` > >> + It is written in Python and runs on the master node only. It waits > for > >> + requests issued remotely through the remote API protocol. Then, it > >> forwards > >> + them, using the LUXI protocol, to the master daemon (if they are > >> commands) or > >> + to the query daemon if they are queries about the configuration > >> (including > >> + live status) of the cluster. > >> + > >> +``Node daemon (NodeD)`` > >> + It is written in Python. It runs on the VM-capable nodes. It is > >> responsible > >> + for receiving the master requests over RPC and execute them, using > the > >> + appropriate backend (hypervisors, DRBD, LVM, etc.). It also receives > >> requests > >> + over RPC for the execution of queries gathering live data on behalf > of > >> the > >> + query daemon. > >> + > >> +``Configuration daemon (ConfD)`` > >> + It is written in Haskell. It runs on all the master candidates. Since > >> the > >> + configuration is replicated only on the master node, this daemon > exists > >> in > >> + order to provide information about the configuration to nodes needing > >> them. > >> + The requests are done through ConfD's own protocol, HMAC signed, > >> + implemented over UDP, and meant to be used by parallely querying all > >> the > >> + master candidates (or a subset thereof) and getting the more up to > date > > > > s/more/most/ > Ok. > >> > >> + answer. This is meant as a way to provide a robust service even in > case > >> master > >> + is temporarily unavailable. > >> + > >> +``Query daemon (QueryD)`` > > > > It's actually called LUXI daemon, even in 2.8. > It was called QueryD for a really short time. And it was renamed LuxiD after the meeting where the core of this design doc was discussed. So, even if it is, matter-of-factly, already called LuxiD, it was logically meant to be QueryD. We anticipated the renaming just to make the following merges easier. And, if you have a look at the "Implementation" section, you'll notice that it is already specified that the renaming to LuxiD is performed in 2.8 already. > >> > >> + It is written in Haskell. It runs on all the master candidates. It > >> replies > >> + to Luxi queries about the current status of the system, including > live > >> data it > >> + obtains by querying the node daemons through RPCs. > >> + > >> +``Monitoring daemon (MonD)`` > >> + It is written in Haskell. It runs on all nodes, including the ones > that > >> are > >> + not vm-capable. It is meant to provide information on the status of > the > >> + system. Such information is related only to the specific node the > >> daemon is > >> + running on, and it is provided as JSON encoded data over HTTP, to be > >> easily > >> + readable by external tools. > >> + The monitoring daemon communicates with ConfD to get information > about > >> the > >> + configuration of the cluster. The choice of communicating with ConfD > >> instead > >> + of MasterD allows it to obtain configuration information even when > the > >> cluster > >> + is heavily degraded (e.g.: when master and some, but not all, of the > >> master > >> + candidates are unreachable). > >> + > >> +The current structure of the Ganeti daemons is inefficient because > there > >> are > >> +many different protocols involved, and each daemon needs to be able to > >> use > >> +multiple ones, and has to deal with doing different things, thus making > >> +sometimes unclear which daemon is responsible for performing a specific > >> task. > >> + > >> +Also, with the current configuration, jobs are managed by the master > >> daemon > >> +using python threads. This makes terminating a job after it has > started a > >> +difficult operation, and it is the main reason why this is not possible > >> yet. > >> + > >> +The master daemon currently has too many different tasks, that could be > >> handled > >> +better if split among different daemons. > >> + > >> + > >> +Proposed changes > >> +================ > >> + > >> +In order to improve on the current situation, a new daemon subdivision > is > >> +proposed, and presented hereafter. > >> + > >> +.. digraph:: "new-daemons-structure" > >> + > >> + {rank=same; ConfDR LuxiD;} > >> + node [shape=box] > >> + RapiD [label="RapiD [M]"] > >> + LuxiD [label="LuxiD [M]"] > >> + ConfDW [label="ConfDW [M]"] > >> + Jobs [label="Jobs [M]"] > >> + ConfDR [label="ConfDR [MC]"] > >> + MonD [label="MonD [All]"] > >> + NodeD [label="NodeD [VM-capable]"] > >> + p1 [shape=none, label=""] > >> + p2 [shape=none, label=""] > >> + p3 [shape=none, label=""] > >> + p4 [shape=none, label=""] > >> + configdata [shape=none, label="config.data"] > >> + locksdata [shape=none, label="locks.data"] > >> + > >> + RapiD -> LuxiD [label="LUXI"] > >> + LuxiD -> ConfDW [label="unix\nsockets"] > >> + LuxiD -> Jobs [label="fork/exec"] > >> + Jobs -> ConfDW > >> + Jobs -> NodeD [label="RPC"] > >> + LuxiD -> NodeD [label="RPC"] > >> + ConfDW -> ConfDR [label="push\nconfig\ndata"] > >> + ConfDW -> configdata > >> + ConfDW -> locksdata > >> + MonD -> ConfDR [label="ConfD proto"] > >> + p1 -> MonD [label="MonD proto"] > >> + p2 -> RapiD [label="RAPI"] > >> + p3 -> LuxiD [label="gnt-*\nclients"] > >> + p4 -> ConfDR [label="ConfD proto"] > >> + > >> +``LUXI daemon (LuxiD)`` > >> + It will be written in Haskell. It will run on the master node and it > >> will be > >> + the only LUXI server, replying to all the LUXI queries. These > includes > >> both > >> + the queries about the live configuration of the cluster, previously > >> served by > >> + QueryD, and the commands actually changing the status of the cluster > by > > > > > > QueryD did never exist, the history was rewritten ;-). > Commit 670e954ab53c9ea1ed5dbf94822e7d345aca2c8d, "Add queryd daemon (split from confd)", doesn't agree with this sentence. ;-) It was just renamed to LuxiD shortly afterwards as a consequence of this not-yet-written design doc. Yes, that's some serious case temporal paradox, but not of history rewriting, so I guess the QueryD name can (and actually should) remain. > > >> > >> + submitting jobs. Therefore, this daemon will also be the one > >> responsible with > >> + managing the job queue. When a job needs to be executed, the LuxiD > will > >> spawn > >> + a separate process tasked with the execution of that specific job, > thus > >> making > >> + it easier to terminate the job itself, if needeed. When a job > requires > >> locks, > >> + LuxiD will request them to ConfDW > > > > > > s/to/from/ > Ok. > > > >> + > >> +``Configuration management daemon (ConfDW)`` > >> + It will run on the master node and it will be responsible for the > >> management > >> + of the authoritative copy of the cluster configuration (that is, it > >> will be > >> + the daemon actually modifying the ``config.data`` file). All the > >> requests of > >> + configuration changes will have to pass through this daemon. Having a > >> single > >> + point of configuration management will also allow Ganeti to get rid > of > >> + possible race conditions due to concurrent modifications of the > >> configuration. > >> + When the configuration is updated, it will have to push the received > >> changes > >> + to the ConfDR daemons, to keep them up to date. > >> + This daemon will also be the one responsible for managing the locks, > >> granting > >> + them to the jobs requesting them, and taking care of freeing them up > if > >> the > >> + jobs holding them crash or are terminated before releasing them. > > > > > > How? > > > > To be detailed. (in this or a separate design, to keep just the split > simpler). > (I believe it should be detailed, but as long as we don't think it's > impossible we can defer the detailing and point from here to a second > design: of course we should have that design too, before > implementing). > I guess checking for the existence of a process with the PID of the lock older should be enough. I know PIDs are not ensured to be uniques, but I think they are unique enough for this not to be a problem. And if we really think this is going to be a problem, we can also check the actual program command line via /proc. I'll add a bit more detail here, but not too much, as I think this should be expanded with the chosen solution when performing the actual split, as it seems to me it is slightly more than an implementation detail. > > >> > >> + Also, it should hold a serialized list of the locks and their owners > in > >> a file > >> + (``locks.data``), so that it can keep track of their status in case > it > >> crashes > >> + and needs to be restarted. > >> + Interaction with this daemon will be performed using Unix sockets. > >> + > >> +``Configuration query daemon (ConfDR)`` > >> + It is written in Haskell, and it corresponds to the old ConfD. It > will > >> run on > >> + all the master candidates and it will serve information about the the > >> static > >> + configuration of the cluster (the one contained in ``config.data``). > >> The > >> + provided information will be highly available (as in: a response will > >> be > >> + available as long as a stable-enough connection between the client > and > >> at > >> + least one working master candidate is available) and its freshness > will > >> be > >> + best effort (the most recent reply from any of the master candidates > >> will be > >> + returned, but it might still be older than the one available through > >> ConfDW). > >> + The information will be served through the ConfD protocol. > >> + > >> +``Rapi daemon (RapiD)`` > >> + It remains basically unchanged, with the only difference that all of > >> its LUXI > >> + query are directed towards LuxiD instead of being split between > MasterD > >> and > >> + QueryD. > >> + > >> +``Monitoring daemon (MonD)`` > >> + It remains unaffected by the changes in this design document. It will > >> just get > >> + some of the data it needs from ConfDR instead of the old ConfD, but > the > >> + interfaces of the two are identical. > >> + > >> +``Node daemon (NodeD)`` > >> + It remains unaffected by the changes proposed in the design document. > >> The only > >> + difference being that it will receive its RPCs from LuxiD instead of > >> MasterD. > > > > > > The jobs remain in Python, and the processes will probably still be > called > > 'master-job' or something, right? I'm not sure if LuxiD actually ever has > > the need to issue RPC's, I thought not. > > > > Probably not for jobs. It might still execute some for things like > propagating the config, and definitely as it does today for executing > queries. > > Given that LuxiD is replying to queries including live data, sometimes it will need to get information from the noded, thus performing RPCs. I'll add this reason to the design document. > >> > >> + > >> +This restructuring will allow us to reorganize and improve the > codebase, > >> +introducing cleaner interfaces and giving well defined and more > >> restricted tasks > >> +to each daemon. > >> + > >> +Furthermore, having more well-defined interfaces will allow us to have > >> easier > >> +upgrade procedures, and to work towards the possibility of upgrading > >> single > >> +components of a cluster one at a time, without the need for immediately > >> +upgrading the entire cluster in a single step. > >> + > >> + > >> +Implementation > >> +============== > >> + > >> +While performing this refactoring, we aim to increase the amount of > >> +Haskell code, thus benefiting from the additional type safety provided > by > >> its > >> +wide compile-time checks. In particular, all the job queue management > and > >> the > >> +configuration management daemon will be written in Haskell, taking over > >> the role > >> +currently fulfilled by Python code executed as part of MasterD. > >> + > >> +The changes describe by this design document are quite extensive, > >> therefore they > >> +awill not be implemented all at the same time, but through a sequence > of > >> steps, > > > > > > s/awill/will/ > Ok. > > > >> > >> +leaving the codebase in a consistent and usable state. > >> + > >> +#. Rename QueryD to LuxiD. > > > > > > Already done. QueryD existed only for a day or so and is probably not > worth > > mentioning. > If I recall correctly, the review of the patch introducing the renaming was LGTMed (I think by Iustin) after the promise of a design doc explaining the reason for that. This is such a design doc, so I think it should stay here. > > > >> > >> + A part of LuxiD, the one replying to configuration > >> + queries including live information about the system, already exists > in > >> the > >> + form of QueryD. This is being renamed to LuxiD, and will form the > >> first part > >> + of the new daemon. NB: this is happening in Ganeti 2.8. > >> + > >> +#. Let LuxiD be the interface for the queries and MasterD be their > >> executor. > >> + Currently, MasterD is the only responsible for receiving and > executing > >> LUXI > >> + queries, and for managing the jobs they create. > >> + Receiving the queries and managing the job queue will be extracted > >> from > >> + MasterD into LuxiD. > >> + Actually executing jobs will still be done by MasterD, that contains > >> all the > >> + logic for doing that and for properly managing locks and the > >> configuration. > >> + MasterD still has to ask back for cancellations. > > > > > > What does "ask back for cancellation" mean? > It is taken verbatim form the picture of the whiteboard where we sketched this design :-) I think it was meant as in: given that masterd will be the one executing the jobs, but luxid will be, at this point, managing the job queue, masterd will have to check if a job is being cancelled after it has been dispatched to it (like, before executing each opcode that is part of the job itself, or something similar). Unless we decide that jobs are still not cancellable, until they are actually run as independent processes, which will happen as the last step of the trasformation described by this document. In which case, I can just remove this sentence, and posticipate everything (which is probably going to be much simpler). > > > > So, the job queue component in MasterD will be deleted, right? Instead, > > would MasterD execute every job received over LUXI directly, and LuxiD > makes > > sure that it's not too much? How do we deal with jobs which can't be > > executed due to locks? Should LuxiD be able to hand a job to MasterD with > > some kind of "no wait" option, which fails the job immediately if locks > are > > missing? How to avoid starvation in such a case? > > > > The long term proposal I believe is to have no masterd, but just to > fork LUs off luxid. > This might not be the case in the first implementation though, so we > may keep a stripped down masterd for a version or two. > > >> > >> + > >> +#. Extract ConfDW from MasterD. > >> + The logic for managing the configuration file is factored out to the > >> + dedicated ConfDW daemon. > >> + > >> +#. Extract locking management from MasterD. > >> + The logic for managing and granting locks is extracted to ConfDW as > >> well. > >> + This step can be executed on its own or at the same time as the > >> previous one. > >> > >> + > >> +#. Jobs are executed as processes. > >> + The logic for running jobs and for sending RPCs to NodeD is > rewritten > >> in > >> + Haskell, so that each job can be managed by an independent process. > > > > > > I don't think that we want to rewrite jobs in Haskell. Instead, MasterD > > would be changed in such a way that all threading code is removed, and > > basically only LU's remain in there. Locking, config handling, etc. is at > > this point already performed in other daemons. > Yes, actually, this was my mistake while writing the document. LUs will remain in python for the foreseeable future. I'll fix this in the doc. > > > > Things like cluster init, bootstrapping, etc. might be interesting. > > > > They already are today. :) > > Thanks, > > Guido > Thanks Michele -- Google Germany GmbH Dienerstr. 12 80331 München Registergericht und -nummer: Hamburg, HRB 86891 Sitz der Gesellschaft: Hamburg Geschäftsführer: Graham Law, Christine Elizabeth Flores
