On Wed, Aug 7, 2013 at 1:50 PM, Guido Trotter <[email protected]> wrote:

> On Tue, Aug 6, 2013 at 6:29 PM, Iustin Pop <[email protected]> wrote:
> > On Tue, Aug 06, 2013 at 03:56:24PM +0000, Michele Tartara wrote:
> >> This describes the future planned structure of Ganeti daemons.
> >>
> >> Signed-off-by: Michele Tartara <[email protected]>
> >> ---
> >>  Makefile.am            |    1 +
> >>  doc/design-daemons.rst |  236
> ++++++++++++++++++++++++++++++++++++++++++++++++
> >>  doc/design-draft.rst   |    1 +
> >>  3 files changed, 238 insertions(+)
> >>  create mode 100644 doc/design-daemons.rst
> >>
> >> diff --git a/Makefile.am b/Makefile.am
> >> index 531197c..7714052 100644
> >> --- a/Makefile.am
> >> +++ b/Makefile.am
> >> @@ -422,6 +422,7 @@ docinput = \
> >>       doc/design-cpu-pinning.rst \
> >>       doc/design-device-uuid-name.rst \
> >>       doc/design-draft.rst \
> >> +     doc/design-daemons.rst \
> >>       doc/design-htools-2.3.rst \
> >>       doc/design-http-server.rst \
> >>       doc/design-impexp2.rst \
> >> diff --git a/doc/design-daemons.rst b/doc/design-daemons.rst
> >> new file mode 100644
> >> index 0000000..e10c942
> >> --- /dev/null
> >> +++ b/doc/design-daemons.rst
> >> @@ -0,0 +1,236 @@
> >> +==========================
> >> +Ganeti daemons refactoring
> >> +==========================
> >> +
> >> +.. contents:: :depth: 2
> >> +
> >> +This is a design document detailing the plan for refactoring the
> internal
> >> +structure of Ganeti, and particularly the set of daemons it is divided
> into.
> >> +
> >> +
> >> +Current state and shortcomings
> >> +==============================
> >> +
> >> +Ganeti is comprised of a growing number of daemons, each dealing with
> part of
> >> +the tasks the cluster has to face, and communicating with the other
> daemons
> >> +using a variety of protocol.
> >> +
> >> +Specifically, as of Ganeti 2.8, the situation is as follows:
> >> +
> >> +``Master daemon (MonD)``
> >
> > MonD→typo?
> >
>

Yes, of course.


> >> +  It is responsible for managing the entire cluster, and it's written
> in Python.
> >> +  It is executed on a single node (the master node). It receives the
> commands
> >> +  given by the cluster administrator (through the remote API daemon or
> the
> >> +  command line tools) over the LUXI protocol.  The master daemon is
> responsible
> >> +  for creating and managing the jobs that will execute such commands,
> and for
> >> +  managing the locks that ensure the cluster will not incur in race
> conditions.
> >> +
> >> +  Each job is managed by a separate Python thread, that interacts with
> the node
> >> +  daemons via RPC calls.
> >> +
> >> +  The master daemon is also responsible for managing the configuration
> of the
> >> +  cluster, changing it when required by some job. It is also
> responsible for
> >> +  copying the configuration to the other master candidates after
> updating it.
> >> +
> >> +``RAPI daemon (RapiD)``
> >> +  It is written in Python and runs on the master node only. It waits
> for
> >> +  requests issued remotely through the remote API protocol. Then, it
> forwards
> >> +  them, using the LUXI protocol, to the master daemon (if they are
> commands) or
> >> +  to the query daemon if they are queries about the configuration
> (including
> >> +  live status) of the cluster.
> >> +
> >> +``Node daemon (NodeD)``
> >> +  It is written in Python. It runs on the VM-capable nodes. It is
> responsible
> >> +  for receiving the master requests over RPC and execute them, using
> the
> >> +  appropriate backend (hypervisors, DRBD, LVM, etc.). It also receives
> requests
> >> +  over RPC for the execution of queries gathering live data on behalf
> of the
> >> +  query daemon.
> >> +
> >> +``Configuration daemon (ConfD)``
> >> +  It is written in Haskell. It runs on all the master candidates.
> Since the
> >> +  configuration is replicated only on the master node, this daemon
> exists in
> >> +  order to provide information about the configuration to nodes
> needing them.
> >> +  The requests are done through ConfD's own protocol, HMAC signed,
> >> +  implemented over UDP, and meant to be used by parallely querying all
> the
> >> +  master candidates (or a subset thereof) and getting the more up to
> date
> >> +  answer. This is meant as a way to provide a robust service even in
> case master
> >> +  is temporarily unavailable.
> >> +
> >> +``Query daemon (QueryD)``
> >> +  It is written in Haskell. It runs on all the master candidates. It
> replies
> >> +  to Luxi queries about the current status of the system, including
> live data it
> >> +  obtains by querying the node daemons through RPCs.
> >> +
> >> +``Monitoring daemon (MonD)``
> >> +  It is written in Haskell. It runs on all nodes, including the ones
> that are
> >> +  not vm-capable. It is meant to provide information on the status of
> the
> >> +  system. Such information is related only to the specific node the
> daemon is
> >> +  running on, and it is provided as JSON encoded data over HTTP, to be
> easily
> >> +  readable by external tools.
> >> +  The monitoring daemon communicates with ConfD to get information
> about the
> >> +  configuration of the cluster. The choice of communicating with ConfD
> instead
> >> +  of MasterD allows it to obtain configuration information even when
> the cluster
> >> +  is heavily degraded (e.g.: when master and some, but not all, of the
> master
> >> +  candidates are unreachable).
> >> +
> >> +The current structure of the Ganeti daemons is inefficient because
> there are
> >> +many different protocols involved, and each daemon needs to be able to
> use
> >> +multiple ones, and has to deal with doing different things, thus making
> >> +sometimes unclear which daemon is responsible for performing a
> specific task.
> >> +
> >> +Also, with the current configuration, jobs are managed by the master
> daemon
> >> +using python threads. This makes terminating a job after it has
> started a
> >> +difficult operation, and it is the main reason why this is not
> possible yet.
> >> +
> >> +The master daemon currently has too many different tasks, that could
> be handled
> >> +better if split among different daemons.
> >> +
> >> +
> >> +Proposed changes
> >> +================
> >> +
> >> +In order to improve on the current situation, a new daemon subdivision
> is
> >> +proposed, and presented hereafter.
> >> +
> >> +.. digraph:: "new-daemons-structure"
> >> +
> >> +  {rank=same; ConfDR LuxiD;}
> >> +  node [shape=box]
> >> +  RapiD [label="RapiD [M]"]
> >> +  LuxiD [label="LuxiD [M]"]
> >> +  ConfDW [label="ConfDW [M]"]
> >> +  Jobs [label="Jobs [M]"]
> >> +  ConfDR [label="ConfDR [MC]"]
> >> +  MonD [label="MonD [All]"]
> >> +  NodeD [label="NodeD [VM-capable]"]
> >> +  p1 [shape=none, label=""]
> >> +  p2 [shape=none, label=""]
> >> +  p3 [shape=none, label=""]
> >> +  p4 [shape=none, label=""]
> >> +  configdata [shape=none, label="config.data"]
> >> +  locksdata [shape=none, label="locks.data"]
> >> +
> >> +  RapiD -> LuxiD [label="LUXI"]
> >> +  LuxiD -> ConfDW [label="unix\nsockets"]
> >> +  LuxiD -> Jobs [label="fork/exec"]
> >> +  Jobs -> ConfDW
> >> +  Jobs -> NodeD [label="RPC"]
> >> +  LuxiD -> NodeD [label="RPC"]
> >> +  ConfDW -> ConfDR [label="push\nconfig\ndata"]
> >> +  ConfDW -> configdata
> >> +  ConfDW -> locksdata
> >> +  MonD -> ConfDR [label="ConfD proto"]
> >> +  p1 -> MonD [label="MonD proto"]
> >> +  p2 -> RapiD [label="RAPI"]
> >> +  p3 -> LuxiD [label="gnt-*\nclients"]
> >> +  p4 -> ConfDR [label="ConfD proto"]
> >> +
> >> +``LUXI daemon (LuxiD)``
> >> +  It will be written in Haskell. It will run on the master node and it
> will be
> >> +  the only LUXI server, replying to all the LUXI queries. These
> includes both
> >> +  the queries about the live configuration of the cluster, previously
> served by
> >> +  QueryD, and the commands actually changing the status of the cluster
> by
> >> +  submitting jobs. Therefore, this daemon will also be the one
> responsible with
> >> +  managing the job queue. When a job needs to be executed, the LuxiD
> will spawn
> >> +  a separate process tasked with the execution of that specific job,
> thus making
> >> +  it easier to terminate the job itself, if needeed.  When a job
> requires locks,
> >> +  LuxiD will request them to ConfDW
> >> +
> >> +``Configuration management daemon (ConfDW)``
> >> +  It will run on the master node and it will be responsible for the
> management
> >> +  of the authoritative copy of the cluster configuration (that is, it
> will be
> >> +  the daemon actually modifying the ``config.data`` file). All the
> requests of
> >> +  configuration changes will have to pass through this daemon. Having
> a single
> >> +  point of configuration management will also allow Ganeti to get rid
> of
> >> +  possible race conditions due to concurrent modifications of the
> configuration.
> >> +  When the configuration is updated, it will have to push the received
> changes
> >> +  to the ConfDR daemons, to keep them up to date.
> >> +  This daemon will also be the one responsible for managing the locks,
> granting
> >> +  them to the jobs requesting them, and taking care of freeing them up
> if the
> >> +  jobs holding them crash or are terminated before releasing them.
> >> +  Also, it should hold a serialized list of the locks and their owners
> in a file
> >> +  (``locks.data``), so that it can keep track of their status in case
> it crashes
> >> +  and needs to be restarted.
> >> +  Interaction with this daemon will be performed using Unix sockets.
> >> +
> >> +``Configuration query daemon (ConfDR)``
> >> +  It is written in Haskell, and it corresponds to the old ConfD. It
> will run on
> >> +  all the master candidates and it will serve information about the
> the static
> >> +  configuration of the cluster (the one contained in ``config.data``).
> The
> >> +  provided information will be highly available (as in: a response
> will be
> >> +  available as long as a stable-enough connection between the client
> and at
> >> +  least one working master candidate is available) and its freshness
> will be
> >> +  best effort (the most recent reply from any of the master candidates
> will be
> >> +  returned, but it might still be older than the one available through
> ConfDW).
> >> +  The information will be served through the ConfD protocol.
> >
> > This new split means that master candidates will lose the (current)
> > capability of actually responding to queries (as in gnt-* list) about
> > current cluster state.
> >
> > If this is an intended change, I would suggest documenting it as such.
> >
>
> I believe we broke this capability already in 2.8, as we split luxid
> (initially queryd, now luxid in light of this design) out of confd to
> avoid problems with the RPC certificate access being available on a
> network-accessible daemon (which was a known issue).
>
> So the status currently is:
> - 2.7 MC queries work, but only the non-rpc ones (which seems quite a
> random set, and not a good useable functionality)
> - 2.8 MC queries are broken altogether
>
> If we want this functionality we should explicitly design for it, have
> luxid&confdW available (read only) on MCs, and use them.
> Then RAPI (also read only) would be useful too, I guess. Sorry I
> hadn't noticed that queries were supposed to be work on MCs by design:
> we can definitely discuss that, but given the current stable releases
> status the breakage at least is not there.


> >> +``Rapi daemon (RapiD)``
> >> +  It remains basically unchanged, with the only difference that all of
> its LUXI
> >> +  query are directed towards LuxiD instead of being split between
> MasterD and
> >> +  QueryD.
> >> +
> >> +``Monitoring daemon (MonD)``
> >> +  It remains unaffected by the changes in this design document. It
> will just get
> >> +  some of the data it needs from ConfDR instead of the old ConfD, but
> the
> >> +  interfaces of the two are identical.
> >> +
> >> +``Node daemon (NodeD)``
> >> +  It remains unaffected by the changes proposed in the design
> document. The only
> >> +  difference being that it will receive its RPCs from LuxiD instead of
> MasterD.
> >> +
> >> +This restructuring will allow us to reorganize and improve the
> codebase,
> >> +introducing cleaner interfaces and giving well defined and more
> restricted tasks
> >> +to each daemon.
> >> +
> >> +Furthermore, having more well-defined interfaces will allow us to have
> easier
> >> +upgrade procedures, and to work towards the possibility of upgrading
> single
> >> +components of a cluster one at a time, without the need for immediately
> >> +upgrading the entire cluster in a single step.
> >> +
> >> +
> >> +Implementation
> >> +==============
> >> +
> >> +While performing this refactoring, we aim to increase the amount of
> >> +Haskell code, thus benefiting from the additional type safety provided
> by its
> >> +wide compile-time checks. In particular, all the job queue management
> and the
> >> +configuration management daemon will be written in Haskell, taking
> over the role
> >> +currently fulfilled by Python code executed as part of MasterD.
> >> +
> >> +The changes describe by this design document are quite extensive,
> therefore they
> >> +awill not be implemented all at the same time, but through a sequence
> of steps,
> >> +leaving the codebase in a consistent and usable state.
> >> +
> >> +#. Rename QueryD to LuxiD.
> >> +   A part of LuxiD, the one replying to configuration
> >> +   queries including live information about the system, already exists
> in the
> >> +   form of QueryD. This is being renamed to LuxiD, and will form the
> first part
> >> +   of the new daemon. NB: this is happening in Ganeti 2.8.
> >> +
> >> +#. Let LuxiD be the interface for the queries and MasterD be their
> executor.
> >> +   Currently, MasterD is the only responsible for receiving and
> executing LUXI
> >> +   queries, and for managing the jobs they create.
> >> +   Receiving the queries and managing the job queue will be extracted
> from
> >> +   MasterD into LuxiD.
> >> +   Actually executing jobs will still be done by MasterD, that
> contains all the
> >> +   logic for doing that and for properly managing locks and the
> configuration.
> >> +   MasterD still has to ask back for cancellations.
> >> +
> >> +#. Extract ConfDW from MasterD.
> >> +   The logic for managing the configuration file is factored out to the
> >> +   dedicated ConfDW daemon.
> >> +
> >> +#. Extract locking management from MasterD.
> >> +   The logic for managing and granting locks is extracted to ConfDW as
> well.
> >> +   This step can be executed on its own or at the same time as the
> previous one.
> >> +
> >> +#. Jobs are executed as processes.
> >> +   The logic for running jobs and for sending RPCs to NodeD is
> rewritten in
> >> +   Haskell, so that each job can be managed by an independent process.
> >
> > From just reading this design, it's not clear what happens with the LUs.
> > Will they remain written in Python? Will they be rewritten? If
> > remainining in Python, how will they interact with NodeD?
> >
>
> We can and should indeed clarify that: right now the plan is to keep
> them written in python, and execute them in processes forked by jobD.
> Interaction with NodeD would be via RPC as of today. What would change
> is the interaction with locks and the config, which of course must be
> detailed further, before proceeding.
>

I'll rewrite this part making it more clear.


>
> > I would suggest expanding this last paragraph; to an external reader,
> > it's not obvious what the planned changes are in this particular area.
> >
>
> Thanks for the feedback!!
>
> Guido
>

Thanks to both of you.

Michele

-- 
Google Germany GmbH
Dienerstr. 12
80331 München

Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Graham Law, Christine Elizabeth Flores

Reply via email to