Rename, in the design, the repair daemon to maintenance daemon.
This better reflects the role of this daemon as, besides repairs,
it will also take care of other maintenance operations, like
balancing.
Signed-off-by: Klaus Aehlig <[email protected]>
---
doc/design-repaird.rst | 44 +++++++++++++++++++++++---------------------
1 file changed, 23 insertions(+), 21 deletions(-)
diff --git a/doc/design-repaird.rst b/doc/design-repaird.rst
index 3cb6bd8..bd91cc0 100644
--- a/doc/design-repaird.rst
+++ b/doc/design-repaird.rst
@@ -1,11 +1,13 @@
-====================
-Ganeti Repair Daemon
-====================
+=========================
+Ganeti Maintenance Daemon
+=========================
.. contents:: :depth: 4
This design document outlines the implementation of a new Ganeti
-daemon coordinating repairs on a cluster.
+daemon coordinating all maintenance operations on a cluster
+(rebalancing, activate disks, ERROR_down handling, node-repair
+actions).
Current state and shortcomings
@@ -32,7 +34,7 @@ swap.
Proposed changes
================
-We propose the addition of an additional daemon, called ``repaird`` that will
+We propose the addition of an additional daemon, called ``maintd`` that will
coordinate the work for repair needs of individual nodes. The information
about the work to be done will be obtained from a dedicated data collector
via the :doc:`design-monitoring-agent`.
@@ -70,7 +72,7 @@ attempting live migrations, respectively.
details
.......
-An opaque JSON value that the repair daemon will just pass through and
+An opaque JSON value that the maintenance daemon will just pass through and
export. It is intended to contain information about the type of repair
that needs to be done after the respective Ganeti action is finished.
E.g., it might contain information which piece of hardware is to be
@@ -99,7 +101,7 @@ directory will be ``/etc/ganeti/node-diagnose-commands``.
Result forging
..............
-As the repair daemon will take real Ganeti actions based on the diagnose
+As the maintenance daemon will take real Ganeti actions based on the diagnose
reported by the self-diagnose script through the monitoring daemon, we
need to verify integrity of such reports to avoid denial-of-service by
fraudaulent error reports. Therefore, the monitoring daemon will sign
@@ -117,27 +119,27 @@ being requested) for this event and forget about it, as
soon as it is
no longer observed.
Corresponding Ganeti actions will be initiated and success or failure of
-these Ganeti jobs monitored. All jobs submitted by the repair daemon
-will have the string ``gnt:daemon:repaird`` and the event identifier
+these Ganeti jobs monitored. All jobs submitted by the maintenance daemon
+will have the string ``gnt:daemon:maintd`` and the event identifier
in the reason trail, so that :doc:`design-optables` is possible.
Once a job fails, no further jobs will be submitted for this event
to avoid further damage; the repair action is considered failed in this case.
Once all requested actions succeeded, or one failed, the node where the
-event as observed will be tagged by a tag starting with
``repaird:repairready:``
-or ``repaird:repairfailed:``, respectively, where the event identifier is
+event as observed will be tagged by a tag starting with
``maintdd:repairready:``
+or ``maintd:repairfailed:``, respectively, where the event identifier is
encoded in the rest of the tag. On the one hand, it can be used as an
additional verification whether a node is ready for a specific repair.
However, the main purpose is to provide a simple and uniform interface
-to acknowledge an event; once that tag is removed, the repair daemon
+to acknowledge an event; once that tag is removed, the maintenance daemon
will forget about this event, as soon as it is no longer observed by
any monitoring daemon.
-Repair daemon
--------------
+Maintenance daemon
+------------------
-The new daemon ``repaird`` will be running on the master node only. It will
+The new daemon ``maintd`` will be running on the master node only. It will
verify the master status of its node by popular vote in the same way as all the
other master-only daemons. If started on a non-master node, it will exit
immediately with exit code ``exitNotmaster``, i.e., 11.
@@ -180,7 +182,7 @@ as a JSON object with at least the following information.
is still observed.
+ ``failed`` At least one of the submitted jobs has failed. To avoid further
- damage, the repair daemon will not take any further action for this event.
+ damage, the maintenance daemon will not take any further action for this
event.
+ ``completed`` All Ganeti actions associated with this event have been
completed successfully, including tagging the node.
@@ -195,7 +197,7 @@ State
~~~~~
As repairs, especially those involving physically swapping hardware, can take
-a long time, the repair daemon needs to store its state persistently. As we
+a long time, the maintenance daemon needs to store its state persistently. As
we
cannot exclude master-failovers during a repair cycle, it does so by storing
it as part of the Ganeti configuration.
@@ -205,9 +207,9 @@ The SSConf will not be changed.
Superseeding ``harep`` and implicit balancing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-To have a single point coordinating all repair actions, the new repair daemon
+To have a single point coordinating all repair actions, the new daemon
will also have the ability to take over the work currently done by ``harep``.
-To allow a smooth transition, ``repaird`` when carrying out ``harep``'s duties
+To allow a smooth transition, ``maintd`` when carrying out ``harep``'s duties
will add tags in precisely the same way as ``harep`` does.
As the new daemon will have to move instances, it will also have the ability
to balance the cluster in a way coordinated with the necessary evacuation
@@ -222,7 +224,7 @@ continue to exist unchanged as part of the ``htools``.
Mode of operation
~~~~~~~~~~~~~~~~~
-The repair daemon will at fixed interval poll the monitoring daemons for
+The maintenance daemon will at fixed interval poll the monitoring daemons for
the value of the self-diagnose data collector; if load-based balancing is
enabled, it will also collect for the the load data needed.
@@ -232,7 +234,7 @@ A new round will be started if all jobs of the old round
have finished, and
there is an unhandled repair event or the cluster is unbalanced enough (provided
that autobalancing is enabled).
-In each round, ``repaird`` will first determine the most invasive action for
+In each round, ``maintd`` will first determine the most invasive action for
each node; despite the self-diagnose collector summing observations in a single
action recommendation, a new, more invasive recommendation can be issued before
the handling of the first recommendation is finished. For all nodes to be
--
2.4.3.573.g4eafbef