Re: [PATCH stable-2.8] Add daemon split design doc

Guido Trotter Wed, 07 Aug 2013 04:51:02 -0700

On Tue, Aug 6, 2013 at 6:29 PM, Iustin Pop <[email protected]> wrote:
> On Tue, Aug 06, 2013 at 03:56:24PM +0000, Michele Tartara wrote:
>> This describes the future planned structure of Ganeti daemons.
>>
>> Signed-off-by: Michele Tartara <[email protected]>
>> ---
>>  Makefile.am            |    1 +
>>  doc/design-daemons.rst |  236 
>> ++++++++++++++++++++++++++++++++++++++++++++++++
>>  doc/design-draft.rst   |    1 +
>>  3 files changed, 238 insertions(+)
>>  create mode 100644 doc/design-daemons.rst
>>
>> diff --git a/Makefile.am b/Makefile.am
>> index 531197c..7714052 100644
>> --- a/Makefile.am
>> +++ b/Makefile.am
>> @@ -422,6 +422,7 @@ docinput = \
>>       doc/design-cpu-pinning.rst \
>>       doc/design-device-uuid-name.rst \
>>       doc/design-draft.rst \
>> +     doc/design-daemons.rst \
>>       doc/design-htools-2.3.rst \
>>       doc/design-http-server.rst \
>>       doc/design-impexp2.rst \
>> diff --git a/doc/design-daemons.rst b/doc/design-daemons.rst
>> new file mode 100644
>> index 0000000..e10c942
>> --- /dev/null
>> +++ b/doc/design-daemons.rst
>> @@ -0,0 +1,236 @@
>> +==========================
>> +Ganeti daemons refactoring
>> +==========================
>> +
>> +.. contents:: :depth: 2
>> +
>> +This is a design document detailing the plan for refactoring the internal
>> +structure of Ganeti, and particularly the set of daemons it is divided into.
>> +
>> +
>> +Current state and shortcomings
>> +==============================
>> +
>> +Ganeti is comprised of a growing number of daemons, each dealing with part 
>> of
>> +the tasks the cluster has to face, and communicating with the other daemons
>> +using a variety of protocol.
>> +
>> +Specifically, as of Ganeti 2.8, the situation is as follows:
>> +
>> +``Master daemon (MonD)``
>
> MonD→typo?
>
>> +  It is responsible for managing the entire cluster, and it's written in 
>> Python.
>> +  It is executed on a single node (the master node). It receives the 
>> commands
>> +  given by the cluster administrator (through the remote API daemon or the
>> +  command line tools) over the LUXI protocol.  The master daemon is 
>> responsible
>> +  for creating and managing the jobs that will execute such commands, and 
>> for
>> +  managing the locks that ensure the cluster will not incur in race 
>> conditions.
>> +
>> +  Each job is managed by a separate Python thread, that interacts with the 
>> node
>> +  daemons via RPC calls.
>> +
>> +  The master daemon is also responsible for managing the configuration of 
>> the
>> +  cluster, changing it when required by some job. It is also responsible for
>> +  copying the configuration to the other master candidates after updating 
>> it.
>> +
>> +``RAPI daemon (RapiD)``
>> +  It is written in Python and runs on the master node only. It waits for
>> +  requests issued remotely through the remote API protocol. Then, it 
>> forwards
>> +  them, using the LUXI protocol, to the master daemon (if they are 
>> commands) or
>> +  to the query daemon if they are queries about the configuration (including
>> +  live status) of the cluster.
>> +
>> +``Node daemon (NodeD)``
>> +  It is written in Python. It runs on the VM-capable nodes. It is 
>> responsible
>> +  for receiving the master requests over RPC and execute them, using the
>> +  appropriate backend (hypervisors, DRBD, LVM, etc.). It also receives 
>> requests
>> +  over RPC for the execution of queries gathering live data on behalf of the
>> +  query daemon.
>> +
>> +``Configuration daemon (ConfD)``
>> +  It is written in Haskell. It runs on all the master candidates. Since the
>> +  configuration is replicated only on the master node, this daemon exists in
>> +  order to provide information about the configuration to nodes needing 
>> them.
>> +  The requests are done through ConfD's own protocol, HMAC signed,
>> +  implemented over UDP, and meant to be used by parallely querying all the
>> +  master candidates (or a subset thereof) and getting the more up to date
>> +  answer. This is meant as a way to provide a robust service even in case 
>> master
>> +  is temporarily unavailable.
>> +
>> +``Query daemon (QueryD)``
>> +  It is written in Haskell. It runs on all the master candidates. It replies
>> +  to Luxi queries about the current status of the system, including live 
>> data it
>> +  obtains by querying the node daemons through RPCs.
>> +
>> +``Monitoring daemon (MonD)``
>> +  It is written in Haskell. It runs on all nodes, including the ones that 
>> are
>> +  not vm-capable. It is meant to provide information on the status of the
>> +  system. Such information is related only to the specific node the daemon 
>> is
>> +  running on, and it is provided as JSON encoded data over HTTP, to be 
>> easily
>> +  readable by external tools.
>> +  The monitoring daemon communicates with ConfD to get information about the
>> +  configuration of the cluster. The choice of communicating with ConfD 
>> instead
>> +  of MasterD allows it to obtain configuration information even when the 
>> cluster
>> +  is heavily degraded (e.g.: when master and some, but not all, of the 
>> master
>> +  candidates are unreachable).
>> +
>> +The current structure of the Ganeti daemons is inefficient because there are
>> +many different protocols involved, and each daemon needs to be able to use
>> +multiple ones, and has to deal with doing different things, thus making
>> +sometimes unclear which daemon is responsible for performing a specific 
>> task.
>> +
>> +Also, with the current configuration, jobs are managed by the master daemon
>> +using python threads. This makes terminating a job after it has started a
>> +difficult operation, and it is the main reason why this is not possible yet.
>> +
>> +The master daemon currently has too many different tasks, that could be 
>> handled
>> +better if split among different daemons.
>> +
>> +
>> +Proposed changes
>> +================
>> +
>> +In order to improve on the current situation, a new daemon subdivision is
>> +proposed, and presented hereafter.
>> +
>> +.. digraph:: "new-daemons-structure"
>> +
>> +  {rank=same; ConfDR LuxiD;}
>> +  node [shape=box]
>> +  RapiD [label="RapiD [M]"]
>> +  LuxiD [label="LuxiD [M]"]
>> +  ConfDW [label="ConfDW [M]"]
>> +  Jobs [label="Jobs [M]"]
>> +  ConfDR [label="ConfDR [MC]"]
>> +  MonD [label="MonD [All]"]
>> +  NodeD [label="NodeD [VM-capable]"]
>> +  p1 [shape=none, label=""]
>> +  p2 [shape=none, label=""]
>> +  p3 [shape=none, label=""]
>> +  p4 [shape=none, label=""]
>> +  configdata [shape=none, label="config.data"]
>> +  locksdata [shape=none, label="locks.data"]
>> +
>> +  RapiD -> LuxiD [label="LUXI"]
>> +  LuxiD -> ConfDW [label="unix\nsockets"]
>> +  LuxiD -> Jobs [label="fork/exec"]
>> +  Jobs -> ConfDW
>> +  Jobs -> NodeD [label="RPC"]
>> +  LuxiD -> NodeD [label="RPC"]
>> +  ConfDW -> ConfDR [label="push\nconfig\ndata"]
>> +  ConfDW -> configdata
>> +  ConfDW -> locksdata
>> +  MonD -> ConfDR [label="ConfD proto"]
>> +  p1 -> MonD [label="MonD proto"]
>> +  p2 -> RapiD [label="RAPI"]
>> +  p3 -> LuxiD [label="gnt-*\nclients"]
>> +  p4 -> ConfDR [label="ConfD proto"]
>> +
>> +``LUXI daemon (LuxiD)``
>> +  It will be written in Haskell. It will run on the master node and it will 
>> be
>> +  the only LUXI server, replying to all the LUXI queries. These includes 
>> both
>> +  the queries about the live configuration of the cluster, previously 
>> served by
>> +  QueryD, and the commands actually changing the status of the cluster by
>> +  submitting jobs. Therefore, this daemon will also be the one responsible 
>> with
>> +  managing the job queue. When a job needs to be executed, the LuxiD will 
>> spawn
>> +  a separate process tasked with the execution of that specific job, thus 
>> making
>> +  it easier to terminate the job itself, if needeed.  When a job requires 
>> locks,
>> +  LuxiD will request them to ConfDW
>> +
>> +``Configuration management daemon (ConfDW)``
>> +  It will run on the master node and it will be responsible for the 
>> management
>> +  of the authoritative copy of the cluster configuration (that is, it will 
>> be
>> +  the daemon actually modifying the ``config.data`` file). All the requests 
>> of
>> +  configuration changes will have to pass through this daemon. Having a 
>> single
>> +  point of configuration management will also allow Ganeti to get rid of
>> +  possible race conditions due to concurrent modifications of the 
>> configuration.
>> +  When the configuration is updated, it will have to push the received 
>> changes
>> +  to the ConfDR daemons, to keep them up to date.
>> +  This daemon will also be the one responsible for managing the locks, 
>> granting
>> +  them to the jobs requesting them, and taking care of freeing them up if 
>> the
>> +  jobs holding them crash or are terminated before releasing them.
>> +  Also, it should hold a serialized list of the locks and their owners in a 
>> file
>> +  (``locks.data``), so that it can keep track of their status in case it 
>> crashes
>> +  and needs to be restarted.
>> +  Interaction with this daemon will be performed using Unix sockets.
>> +
>> +``Configuration query daemon (ConfDR)``
>> +  It is written in Haskell, and it corresponds to the old ConfD. It will 
>> run on
>> +  all the master candidates and it will serve information about the the 
>> static
>> +  configuration of the cluster (the one contained in ``config.data``). The
>> +  provided information will be highly available (as in: a response will be
>> +  available as long as a stable-enough connection between the client and at
>> +  least one working master candidate is available) and its freshness will be
>> +  best effort (the most recent reply from any of the master candidates will 
>> be
>> +  returned, but it might still be older than the one available through 
>> ConfDW).
>> +  The information will be served through the ConfD protocol.
>
> This new split means that master candidates will lose the (current)
> capability of actually responding to queries (as in gnt-* list) about
> current cluster state.
>
> If this is an intended change, I would suggest documenting it as such.
>


I believe we broke this capability already in 2.8, as we split luxid
(initially queryd, now luxid in light of this design) out of confd to
avoid problems with the RPC certificate access being available on a
network-accessible daemon (which was a known issue).

So the status currently is:
- 2.7 MC queries work, but only the non-rpc ones (which seems quite a
random set, and not a good useable functionality)
- 2.8 MC queries are broken altogether

If we want this functionality we should explicitly design for it, have
luxid&confdW available (read only) on MCs, and use them.
Then RAPI (also read only) would be useful too, I guess. Sorry I
hadn't noticed that queries were supposed to be work on MCs by design:
we can definitely discuss that, but given the current stable releases
status the breakage at least is not there.

>> +``Rapi daemon (RapiD)``
>> +  It remains basically unchanged, with the only difference that all of its 
>> LUXI
>> +  query are directed towards LuxiD instead of being split between MasterD 
>> and
>> +  QueryD.
>> +
>> +``Monitoring daemon (MonD)``
>> +  It remains unaffected by the changes in this design document. It will 
>> just get
>> +  some of the data it needs from ConfDR instead of the old ConfD, but the
>> +  interfaces of the two are identical.
>> +
>> +``Node daemon (NodeD)``
>> +  It remains unaffected by the changes proposed in the design document. The 
>> only
>> +  difference being that it will receive its RPCs from LuxiD instead of 
>> MasterD.
>> +
>> +This restructuring will allow us to reorganize and improve the codebase,
>> +introducing cleaner interfaces and giving well defined and more restricted 
>> tasks
>> +to each daemon.
>> +
>> +Furthermore, having more well-defined interfaces will allow us to have 
>> easier
>> +upgrade procedures, and to work towards the possibility of upgrading single
>> +components of a cluster one at a time, without the need for immediately
>> +upgrading the entire cluster in a single step.
>> +
>> +
>> +Implementation
>> +==============
>> +
>> +While performing this refactoring, we aim to increase the amount of
>> +Haskell code, thus benefiting from the additional type safety provided by 
>> its
>> +wide compile-time checks. In particular, all the job queue management and 
>> the
>> +configuration management daemon will be written in Haskell, taking over the 
>> role
>> +currently fulfilled by Python code executed as part of MasterD.
>> +
>> +The changes describe by this design document are quite extensive, therefore 
>> they
>> +awill not be implemented all at the same time, but through a sequence of 
>> steps,
>> +leaving the codebase in a consistent and usable state.
>> +
>> +#. Rename QueryD to LuxiD.
>> +   A part of LuxiD, the one replying to configuration
>> +   queries including live information about the system, already exists in 
>> the
>> +   form of QueryD. This is being renamed to LuxiD, and will form the first 
>> part
>> +   of the new daemon. NB: this is happening in Ganeti 2.8.
>> +
>> +#. Let LuxiD be the interface for the queries and MasterD be their executor.
>> +   Currently, MasterD is the only responsible for receiving and executing 
>> LUXI
>> +   queries, and for managing the jobs they create.
>> +   Receiving the queries and managing the job queue will be extracted from
>> +   MasterD into LuxiD.
>> +   Actually executing jobs will still be done by MasterD, that contains all 
>> the
>> +   logic for doing that and for properly managing locks and the 
>> configuration.
>> +   MasterD still has to ask back for cancellations.
>> +
>> +#. Extract ConfDW from MasterD.
>> +   The logic for managing the configuration file is factored out to the
>> +   dedicated ConfDW daemon.
>> +
>> +#. Extract locking management from MasterD.
>> +   The logic for managing and granting locks is extracted to ConfDW as well.
>> +   This step can be executed on its own or at the same time as the previous 
>> one.
>> +
>> +#. Jobs are executed as processes.
>> +   The logic for running jobs and for sending RPCs to NodeD is rewritten in
>> +   Haskell, so that each job can be managed by an independent process.
>
> From just reading this design, it's not clear what happens with the LUs.
> Will they remain written in Python? Will they be rewritten? If
> remainining in Python, how will they interact with NodeD?
>

We can and should indeed clarify that: right now the plan is to keep
them written in python, and execute them in processes forked by jobD.
Interaction with NodeD would be via RPC as of today. What would change
is the interaction with locks and the config, which of course must be
detailed further, before proceeding.

> I would suggest expanding this last paragraph; to an external reader,
> it's not obvious what the planned changes are in this particular area.
>

Thanks for the feedback!!

Guido

Re: [PATCH stable-2.8] Add daemon split design doc

Reply via email to