Hi!

On Tue, 29 Sep 2015 at 11:21 'Klaus Aehlig' via ganeti-devel <
[email protected]> wrote:

> Ganeti provides high availability by ensuring N+1 redundancy is
> maintained. In some situations, however, like planning larger
> maintenance events, it is desirable to have an estimate for how
> many nodes can be removes with the cluster remaining operational.
> Add a design for this concept.
>
> Signed-off-by: Klaus Aehlig <[email protected]>
> ---
>  Makefile.am                   |  1 +
>  doc/design-draft.rst          |  1 +
>  doc/design-n-m-redundancy.rst | 71
> +++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 73 insertions(+)
>  create mode 100644 doc/design-n-m-redundancy.rst
>
> diff --git a/Makefile.am b/Makefile.am
> index a506296..a56135a 100644
> --- a/Makefile.am
> +++ b/Makefile.am
> @@ -703,6 +703,7 @@ docinput = \
>         doc/design-multi-reloc.rst \
>         doc/design-multi-storage-htools.rst \
>         doc/design-multi-version-tests.rst \
> +       doc/design-n-m-redundancy.rst \
>         doc/design-network.rst \
>         doc/design-network2.rst \
>         doc/design-node-add.rst \
> diff --git a/doc/design-draft.rst b/doc/design-draft.rst
> index 353f0cd..447b4e5 100644
> --- a/doc/design-draft.rst
> +++ b/doc/design-draft.rst
> @@ -27,6 +27,7 @@ Design document drafts
>     design-repaird.rst
>     design-migration-speed-hbal.rst
>     design-memory-over-commitment.rst
> +   design-n-m-redundancy.rst
>
>  .. vim: set textwidth=72 :
>  .. Local Variables:
> diff --git a/doc/design-n-m-redundancy.rst b/doc/design-n-m-redundancy.rst
> new file mode 100644
> index 0000000..696bd5e
> --- /dev/null
> +++ b/doc/design-n-m-redundancy.rst
> @@ -0,0 +1,71 @@
> +===========================
> +Checking for N+M redundancy
> +===========================
> +
> +.. contents:: :depth: 4
> +
> +This document describes how the level of redundancy is estimated
> +in Ganeti.
> +
> +
> +Current state and shortcomings
> +==============================
> +
> +Ganeti keeps the cluster N+1 redundant, also taking into account
> +:doc:`design-shared-storage-redundancy`. However, e.g., for planning
> +maintenance, it is sometimes desirable to know from how many node
> +losses the cluster can recover from. This is also useful information,
> +when operating big clusters and expecting long times for hardware repair.
>

To give a bit of context to readers not familiar with 'N+1' redundancy, it
might be good to add one sentence describing the term in contrast to N+M
redundancy.


> +
> +
> +Proposed changes
> +================
> +
> +Higher redundancy as a sequential concept
> +-----------------------------------------
> +
> +The intuitive meaning of an N+M redundant cluster is that M nodes can
> +fail without instances being lost. However, when DRBD is used, already
> +failure of 2 nodes can cause complete loss of an instance. Therefore, the
> +best we can hope for, is to be able to recover from M sequential failures.
> +
> +Definition of M+M redundancy
>

*N*+M


> +----------------------------
> +
> +We keep the definition of :doc:`design-shared-storage-redundancy`.
> Moreover,
> +for M a non-negative integer, we define a cluster to be N+(M+2) redundant,
> +if after draining any node the standard rebalancing procedure (as, e.g.,
> +provided by `hbal`) will fully evacuate that node and result in an N+(M+1)
> +redundant cluster.
>

While this is correct, it is a rather mathematical definition. I think it
would be nice to explain this in an admin-understandable way as well (such
as 'N+M redundancy means that M nodes can fail and still no instances are
lost').


> +
> +Independence of Groups
> +----------------------
> +
> +Immediately from the definition, we see that the redundancy level, i.e.,
> +the maximal M such that the cluster is N+M redundant, can be computed
> +in a group-by-group manner: the standard balancing algorithm will never
> +move instances between node groups. The redundancy level of the cluster
> +is then the minimum of the redundancy level of the independent groups.
> +
> +Estimation of the redundancy level
> +----------------------------------
> +
> +The definition of N+M redundancy requires to consider M failures in
> +arbitrary order, thus considering super-exponentially many cases for
> +large M. As, however, balancing moves instances anyway, the redundancy
> +level mainly depends on the amount of node resources available to the
> +instances in a node group. So we can get a good approximation of the
> +redundancy level of a node group by only considering draining one largest
> +node in that group. This is how Ganeti will estimate the redundancy level.
> +
> +Modifications to existing tools
> +-------------------------------
> +
> +As redundancy levels higher than N+1 are mainly about planning capacity,
> +they level of redundancy only needs to be computed on demand. Hence, we
> +keep the tool changes minimal.
> +
> +- ``hcheck`` will report the level of redundancy for each node group as
> +  a new output parameter
> +
> +The rest of Ganeti will not be changed.
> --
> 2.6.0.rc2.230.g3dd15c0
>
>
Rest LGTM, thanks

Cheers,
Helga
-- 

Helga Velroyen
Software Engineer
[email protected]

Google Germany GmbH
Dienerstraße 12
80331 München

Geschäftsführer: Matthew Scott Sucherman, Paul Terence Manicle
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

Diese E-Mail ist vertraulich. Wenn Sie nicht der richtige Adressat sind,
leiten Sie diese bitte nicht weiter, informieren Sie den Absender und
löschen Sie die E-Mail und alle Anhänge. Vielen Dank.

This e-mail is confidential. If you are not the right addressee please do
not forward it, please inform the sender, and please erase this e-mail
including any attachments. Thanks.

Reply via email to