Re: [PATCH stable-2.8] Add daemon split design doc

Michele Tartara Wed, 21 Aug 2013 07:55:03 -0700

On Wed, Aug 21, 2013 at 3:46 PM, Guido Trotter <[email protected]> wrote:


> On Tue, Aug 20, 2013 at 7:08 PM, Michele Tartara <[email protected]>
> wrote:
> > This describes the future planned structure of Ganeti daemons.
> >
> > Signed-off-by: Michele Tartara <[email protected]>
> > ---
> >  Makefile.am            |    1 +
> >  doc/design-daemons.rst |  261
> ++++++++++++++++++++++++++++++++++++++++++++++++
> >  doc/design-draft.rst   |    1 +
> >  3 files changed, 263 insertions(+)
> >  create mode 100644 doc/design-daemons.rst
> >
> > diff --git a/Makefile.am b/Makefile.am
> > index 531197c..7714052 100644
> > --- a/Makefile.am
> > +++ b/Makefile.am
> > @@ -422,6 +422,7 @@ docinput = \
> >         doc/design-cpu-pinning.rst \
> >         doc/design-device-uuid-name.rst \
> >         doc/design-draft.rst \
> > +       doc/design-daemons.rst \
> >         doc/design-htools-2.3.rst \
> >         doc/design-http-server.rst \
> >         doc/design-impexp2.rst \
> > diff --git a/doc/design-daemons.rst b/doc/design-daemons.rst
> > new file mode 100644
> > index 0000000..9a9a489
> > --- /dev/null
> > +++ b/doc/design-daemons.rst
> > @@ -0,0 +1,261 @@
> > +==========================
> > +Ganeti daemons refactoring
> > +==========================
> > +
> > +.. contents:: :depth: 2
> > +
> > +This is a design document detailing the plan for refactoring the
> internal
> > +structure of Ganeti, and particularly the set of daemons it is divided
> into.
> > +
> > +
> > +Current state and shortcomings
> > +==============================
> > +
> > +Ganeti is comprised of a growing number of daemons, each dealing with
> part of
> > +the tasks the cluster has to face, and communicating with the other
> daemons
> > +using a variety of protocols.
> > +
> > +Specifically, as of Ganeti 2.8, the situation is as follows:
> > +
> > +``Master daemon (MasterD)``
> > +  It is responsible for managing the entire cluster, and it's written
> in Python.
> > +  It is executed on a single node (the master node). It receives the
> commands
> > +  given by the cluster administrator (through the remote API daemon or
> the
> > +  command line tools) over the LUXI protocol.  The master daemon is
> responsible
> > +  for creating and managing the jobs that will execute such commands,
> and for
> > +  managing the locks that ensure the cluster will not incur in race
> conditions.
> > +
> > +  Each job is managed by a separate Python thread, that interacts with
> the node
> > +  daemons via RPC calls.
> > +
> > +  The master daemon is also responsible for managing the configuration
> of the
> > +  cluster, changing it when required by some job. It is also
> responsible for
> > +  copying the configuration to the other master candidates after
> updating it.
> > +
> > +``RAPI daemon (RapiD)``
> > +  It is written in Python and runs on the master node only. It waits for
> > +  requests issued remotely through the remote API protocol. Then, it
> forwards
> > +  them, using the LUXI protocol, to the master daemon (if they are
> commands) or
> > +  to the query daemon if they are queries about the configuration
> (including
> > +  live status) of the cluster.
> > +
> > +``Node daemon (NodeD)``
> > +  It is written in Python. It runs on all the nodes. It is responsible
> for
> > +  receiving the master requests over RPC and execute them, using the
> appropriate
> > +  backend (hypervisors, DRBD, LVM, etc.). It also receives requests
> over RPC for
> > +  the execution of queries gathering live data on behalf of the query
> daemon.
> > +
> > +``Configuration daemon (ConfD)``
> > +  It is written in Haskell. It runs on all the master candidates. Since
> the
> > +  configuration is replicated only on the master node, this daemon
> exists in
> > +  order to provide information about the configuration to nodes needing
> them.
> > +  The requests are done through ConfD's own protocol, HMAC signed,
> > +  implemented over UDP, and meant to be used by parallely querying all
> the
> > +  master candidates (or a subset thereof) and getting the most up to
> date
> > +  answer. This is meant as a way to provide a robust service even in
> case master
> > +  is temporarily unavailable.
> > +
> > +``Query daemon (QueryD)``
> > +  It is written in Haskell. It runs on all the master candidates. It
> replies
> > +  to Luxi queries about the current status of the system, including
> live data it
> > +  obtains by querying the node daemons through RPCs.
> > +
> > +``Monitoring daemon (MonD)``
> > +  It is written in Haskell. It runs on all nodes, including the ones
> that are
> > +  not vm-capable. It is meant to provide information on the status of
> the
> > +  system. Such information is related only to the specific node the
> daemon is
> > +  running on, and it is provided as JSON encoded data over HTTP, to be
> easily
> > +  readable by external tools.
> > +  The monitoring daemon communicates with ConfD to get information
> about the
> > +  configuration of the cluster. The choice of communicating with ConfD
> instead
> > +  of MasterD allows it to obtain configuration information even when
> the cluster
> > +  is heavily degraded (e.g.: when master and some, but not all, of the
> master
> > +  candidates are unreachable).
> > +
> > +The current structure of the Ganeti daemons is inefficient because
> there are
> > +many different protocols involved, and each daemon needs to be able to
> use
> > +multiple ones, and has to deal with doing different things, thus making
> > +sometimes unclear which daemon is responsible for performing a specific
> task.
> > +
> > +Also, with the current configuration, jobs are managed by the master
> daemon
> > +using python threads. This makes terminating a job after it has started
> a
> > +difficult operation, and it is the main reason why this is not possible
> yet.
> > +
> > +The master daemon currently has too many different tasks, that could be
> handled
> > +better if split among different daemons.
> > +
> > +
> > +Proposed changes
> > +================
> > +
> > +In order to improve on the current situation, a new daemon subdivision
> is
> > +proposed, and presented hereafter.
> > +
> > +.. digraph:: "new-daemons-structure"
> > +
> > +  {rank=same; RConfD LuxiD;}
> > +  {rank=same; Jobs rconfigdata;}
> > +  node [shape=box]
> > +  RapiD [label="RapiD [M]"]
> > +  LuxiD [label="LuxiD [M]"]
> > +  WConfD [label="WConfD [M]"]
> > +  Jobs [label="Jobs [M]"]
> > +  RConfD [label="RConfD [MC]"]
> > +  MonD [label="MonD [All]"]
> > +  NodeD [label="NodeD [All]"]
> > +  Clients [label="gnt-*\nclients [M]"]
> > +  p1 [shape=none, label=""]
> > +  p2 [shape=none, label=""]
> > +  p3 [shape=none, label=""]
> > +  p4 [shape=none, label=""]
> > +  configdata [shape=none, label="config.data"]
> > +  rconfigdata [shape=none, label="config.data\n[MC copy]"]
> > +  locksdata [shape=none, label="locks.data"]
> > +
> > +  RapiD -> LuxiD [label="LUXI"]
> > +  LuxiD -> WConfD [label="WConfD\nproto"]
> > +  LuxiD -> Jobs [label="fork/exec"]
> > +  Jobs -> WConfD [label="WConfD\nproto"]
> > +  Jobs -> NodeD [label="RPC"]
> > +  LuxiD -> NodeD [label="RPC"]
> > +  rconfigdata -> RConfD
> > +  configdata -> rconfigdata [label="sync via\nNodeD RPC"]
> > +  WConfD -> NodeD [label="RPC"]
> > +  WConfD -> configdata
> > +  WConfD -> locksdata
> > +  MonD -> RConfD [label="RConfD\nproto"]
> > +  Clients -> LuxiD [label="LUXI"]
> > +  p1 -> MonD [label="MonD proto"]
> > +  p2 -> RapiD [label="RAPI"]
> > +  p3 -> RConfD [label="RConfD\nproto"]
> > +  p4 -> Clients [label="CLI"]
> > +
>
> Note: while we'll keep job queue and config updates via rpc for now,
> as a default, we'll be able to better evaluate how to handle them
> after this design is implemented.
>
> > +``LUXI daemon (LuxiD)``
> > +  It will be written in Haskell. It will run on the master node and it
> will be
> > +  the only LUXI server, replying to all the LUXI queries. These
> includes both
> > +  the queries about the live configuration of the cluster, previously
> served by
> > +  QueryD, and the commands actually changing the status of the cluster
> by
> > +  submitting jobs. Therefore, this daemon will also be the one
> responsible with
> > +  managing the job queue. When a job needs to be executed, the LuxiD
> will spawn
> > +  a separate process tasked with the execution of that specific job,
> thus making
> > +  it easier to terminate the job itself, if needeed.  When a job
> requires locks,
> > +  LuxiD will request them from WConfD.
> > +  In order to keep availability of the cluster in case of failure of
> the master
> > +  node, LuxiD will replicate the job queue to the other master
> candidates, by
> > +  RPCs to the NodeD running there.
> > +
> > +``Configuration management daemon (WConfD)``
> > +  It will run on the master node and it will be responsible for the
> management
> > +  of the authoritative copy of the cluster configuration (that is, it
> will be
> > +  the daemon actually modifying the ``config.data`` file). All the
> requests of
> > +  configuration changes will have to pass through this daemon, and will
> be
> > +  performed using a LUXI-like protocol ("WConfD proto" in the graph.
> The exact
> > +  protocol will be defined in the separate design document that will
> detail the
> > +  WConfD separation).  Having a single point of configuration
> management will
> > +  also allow Ganeti to get rid of possible race conditions due to
> concurrent
> > +  modifications of the configuration.  When the configuration is
> updated, it
> > +  will have to push the received changes to the other master
> candidates, via
> > +  RPC, so that RConfD daemons and (in case of a failure on the master
> node) the
> > +  WConfD daemon on the new master can access an up-to-date version of
> it.  This
> > +  daemon will also be the one responsible for managing the locks,
> granting them
> > +  to the jobs requesting them, and taking care of freeing them up if
> the jobs
> > +  holding them crash or are terminated before releasing them.  In order
> to do
> > +  this, each job, after being spawned by LuxiD, will open a local unix
> socket
> > +  that will be used to communicate with it, and will be destroyed when
> the job
> > +  terminates.  LuxiD will be able to check, after a timeout, whether
> the job is
> > +  still running by connecting here, and to ask WConfD to forcefully
> remove the
> > +  locks if the socket is closed.
> > +  Also, WConfD should hold a serialized list of the locks and their
> owners in a
> > +  file (``locks.data``), so that it can keep track of their status in
> case it
> > +  crashes and needs to be restarted (by asking LuxiD which of them are
> still
> > +  running).
> > +  Interaction with this daemon will be performed using Unix sockets.
> > +
>
> Please mention what jobs will do in case they don't find an active
> luxid or wconfd, and how we plan for them to be more resilient.
>
> > +``Configuration query daemon (RConfD)``
> > +  It is written in Haskell, and it corresponds to the old ConfD. It
> will run on
> > +  all the master candidates and it will serve information about the the
> static
> > +  configuration of the cluster (the one contained in ``config.data``).
> The
> > +  provided information will be highly available (as in: a response will
> be
> > +  available as long as a stable-enough connection between the client
> and at
> > +  least one working master candidate is available) and its freshness
> will be
> > +  best effort (the most recent reply from any of the master candidates
> will be
> > +  returned, but it might still be older than the one available through
> WConfD).
> > +  The information will be served through the ConfD protocol.
> > +
> > +``Rapi daemon (RapiD)``
> > +  It remains basically unchanged, with the only difference that all of
> its LUXI
> > +  query are directed towards LuxiD instead of being split between
> MasterD and
> > +  QueryD.
> > +
> > +``Monitoring daemon (MonD)``
> > +  It remains unaffected by the changes in this design document. It will
> just get
> > +  some of the data it needs from RConfD instead of the old ConfD, but
> the
> > +  interfaces of the two are identical.
> > +
> > +``Node daemon (NodeD)``
> > +  It remains unaffected by the changes proposed in the design document.
> The only
> > +  difference being that it will receive its RPCs from LuxiD (for job
> queue
> > +  replication), from WConfD (for configuration replication) and for the
> > +  processes executing single jobs (for all the operations to be
> performed by
> > +  nodes) instead of receiving them just from MasterD.
> > +
> > +This restructuring will allow us to reorganize and improve the codebase,
> > +introducing cleaner interfaces and giving well defined and more
> restricted tasks
> > +to each daemon.
> > +
> > +Furthermore, having more well-defined interfaces will allow us to have
> easier
> > +upgrade procedures, and to work towards the possibility of upgrading
> single
> > +components of a cluster one at a time, without the need for immediately
> > +upgrading the entire cluster in a single step.
> > +
> > +
> > +Implementation
> > +==============
> > +
> > +While performing this refactoring, we aim to increase the amount of
> > +Haskell code, thus benefiting from the additional type safety provided
> by its
> > +wide compile-time checks. In particular, all the job queue management
> and the
> > +configuration management daemon will be written in Haskell, taking over
> the role
> > +currently fulfilled by Python code executed as part of MasterD.
> > +
> > +The changes describe by this design document are quite extensive,
> therefore they
> > +will not be implemented all at the same time, but through a sequence of
> steps,
> > +leaving the codebase in a consistent and usable state.
> > +
> > +#. Rename QueryD to LuxiD.
> > +   A part of LuxiD, the one replying to configuration
> > +   queries including live information about the system, already exists
> in the
> > +   form of QueryD. This is being renamed to LuxiD, and will form the
> first part
> > +   of the new daemon. NB: this is happening in Ganeti 2.8.
> > +
>
> s/in Ganeti 2.8/starting from Ganeti 2.8/
> Note that not all queries are implemented in 2.8's luxid, and indeed
> we won't have them fully finished before 2.10 or 2.11.
>
> > +#. Let LuxiD be the interface for the queries and MasterD be their
> executor.
> > +   Currently, MasterD is the only responsible for receiving and
> executing LUXI
> > +   queries, and for managing the jobs they create.
> > +   Receiving the queries and managing the job queue will be extracted
> from
> > +   MasterD into LuxiD.
> > +   Actually executing jobs will still be done by MasterD, that contains
> all the
> > +   logic for doing that and for properly managing locks and the
> configuration.
> > +
>
> A design decision will need to be made here, about how to properly
> decide which jobs to send over and how to rate-limit them.
>
> > +#. Extract WConfD from MasterD.
> > +   The logic for managing the configuration file is factored out to the
> > +   dedicated WConfD daemon. All configuration changes, currently
> executed
> > +   directly by MasterD, will be changed to be IPC requests sent to the
> new
> > +   daemon.
> > +
> > +#. Extract locking management from MasterD.
> > +   The logic for managing and granting locks is extracted to WConfD as
> well.
> > +   Locks will not be taken directly anymore, but asked via IPC to
> WConfD.
> > +   This step can be executed on its own or at the same time as the
> previous one.
> > +
> > +#. Jobs are executed as processes.
> > +   The logic for running jobs is rewritten so that each job can be
> managed by an
> > +   independent process. LuxiD will spawn a new (Python) process for
> every single
> > +   job. The RPCs will remain unchanged, and the LU code will stay as is
> as much
> > +   as possible.
> > +
>
> Please mention that masterd in the current sense will disappear at the
> implementation of this point, but not before.
>
> In general, looks good.
> Thanks,
>
> Guido
>


Interdiff:
diff --git a/doc/design-daemons.rst b/doc/design-daemons.rst
index 9a9a489..9e87194 100644
--- a/doc/design-daemons.rst
+++ b/doc/design-daemons.rst
@@ -142,7 +142,8 @@ proposed, and presented hereafter.
   LuxiD will request them from WConfD.
   In order to keep availability of the cluster in case of failure of the
master
   node, LuxiD will replicate the job queue to the other master candidates,
by
-  RPCs to the NodeD running there.
+  RPCs to the NodeD running there (the choice of RPCs for this task might
be
+  reviewed at a second time, after implementing this design).

 ``Configuration management daemon (WConfD)``
   It will run on the master node and it will be responsible for the
management
@@ -155,8 +156,9 @@ proposed, and presented hereafter.
   also allow Ganeti to get rid of possible race conditions due to
concurrent
   modifications of the configuration.  When the configuration is updated,
it
   will have to push the received changes to the other master candidates,
via
-  RPC, so that RConfD daemons and (in case of a failure on the master
node) the
-  WConfD daemon on the new master can access an up-to-date version of it.
 This
+  RPCs, so that RConfD daemons and (in case of a failure on the master
node)
+  the WConfD daemon on the new master can access an up-to-date version of
it
+  (the choice of RPCs for this task might be reviewed at a second time).
This
   daemon will also be the one responsible for managing the locks, granting
them
   to the jobs requesting them, and taking care of freeing them up if the
jobs
   holding them crash or are terminated before releasing them.  In order to
do
@@ -226,7 +228,9 @@ leaving the codebase in a consistent and usable state.
    A part of LuxiD, the one replying to configuration
    queries including live information about the system, already exists in
the
    form of QueryD. This is being renamed to LuxiD, and will form the first
part
-   of the new daemon. NB: this is happening in Ganeti 2.8.
+   of the new daemon. NB: this is happening starting from Ganeti 2.8. At
the
+   beginning, only the already existing queries will be replied to by
LuxiD.
+   More queries will be implemented in the next versions.

 #. Let LuxiD be the interface for the queries and MasterD be their
executor.
    Currently, MasterD is the only responsible for receiving and executing
LUXI
@@ -235,6 +239,8 @@ leaving the codebase in a consistent and usable state.
    MasterD into LuxiD.
    Actually executing jobs will still be done by MasterD, that contains
all the
    logic for doing that and for properly managing locks and the
configuration.
+   A separate design document will detail how the system will decide which
jobs
+   to send over for execution, and how to rate-limit them.

 #. Extract WConfD from MasterD.
    The logic for managing the configuration file is factored out to the
@@ -252,6 +258,27 @@ leaving the codebase in a consistent and usable state.
    independent process. LuxiD will spawn a new (Python) process for every
single
    job. The RPCs will remain unchanged, and the LU code will stay as is as
much
    as possible.
+   MasterD will cease to exist as a deamon on its own at this point, but
not
+   before.
+
+Further considerations
+======================
+
+There is a possibility that a job will finish performing its task while
LuxiD
+and/or WConfD will not be available.
+In order to deal with this situation, each job will write the results of
its
+execution on a file. The name of this file will be known to LuxiD before
+starting the job, and will be stored together with the job ID, and the
+name of the job-unique socket.
+
+The job, upon ending its execution, will signal LuxiD (through the
socket), so
+that it can read the result of the execution and release the locks as
needed.
+
+In case LuxiD is not available at that time, the job will just terminate
without
+signalling it, and writing the results on file as usual. When a new LuxiD
+becomes available, it will have the most up-to-date list of running jobs
+(received via replication from the former LuxiD), and go through it,
cleaning up
+all the terminated jobs.


 .. vim: set textwidth=72 :



Thanks,
Michele
-- 
Google Germany GmbH
Dienerstr. 12
80331 München

Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Graham Law, Christine Elizabeth Flores

Re: [PATCH stable-2.8] Add daemon split design doc

Reply via email to