On Wed, Jun 23, 2010 at 04:47:32PM +0300, [email protected] wrote:
> From: Apollon Oikonomopoulos <[email protected]>
> 
> Add doc/design-shared-storage.rst to document the proposed changes and update
> Makefile.am respectively.
> 
> Signed-off-by: Apollon Oikonomopoulos <[email protected]>
> ---
>  Makefile.am                   |    1 +
>  doc/design-shared-storage.rst |  146 
> +++++++++++++++++++++++++++++++++++++++++
>  doc/index.rst                 |    1 +
>  3 files changed, 148 insertions(+), 0 deletions(-)
>  create mode 100644 doc/design-shared-storage.rst
> 
> diff --git a/Makefile.am b/Makefile.am
> index b8ad0b2..907a36e 100644
> --- a/Makefile.am
> +++ b/Makefile.am
> @@ -171,6 +171,7 @@ docrst = \
>       doc/design-2.1.rst \
>       doc/design-2.2.rst \
>       doc/design-cluster-merger.rst \
> +     doc/design-shared-storage.rst \
>       doc/devnotes.rst \
>       doc/glossary.rst \
>       doc/hooks.rst \
> diff --git a/doc/design-shared-storage.rst b/doc/design-shared-storage.rst
> new file mode 100644
> index 0000000..7ed5864
> --- /dev/null
> +++ b/doc/design-shared-storage.rst
> @@ -0,0 +1,146 @@
> +======================================
> +Ganeti shared storage support for 2.2+
> +======================================
> +
> +This document describes the changes in Ganeti 2.2+ compared to Ganeti
> +2.2 storage model.
> +
> +.. contents:: :depth: 4
> +
> +Objective
> +=========
> +
> +The aim is to introduce support for externally mirrored, shared storage.
> +This includes two distinct disk templates:
> +
> +- Shared filesystem support using regular files typically residing on a
> +  networked or cluster filesystem (e.g. NFS, AFS, Ceph, OCFS2, etc.).
> +- Shared block device support, with instance images being shared block
> +  devices, typically LUNs residing on a SAN appliance or remote iSCSI
> +  targets.
> +
> +Background
> +==========
> +DRBD is the only shared storage supported by Ganeti. DRBD offers the
> +advantages of high availability with commodity hardware at the cost of
> +high network I/O for block-level synchronization between hosts. DRBD's
> +master-slave model has greatly influenced Ganeti's design, primarily by
> +introducing the concept of primary and secondary nodes and thus defining
> +an instance's mobility domain.
> +
> +Although DRBD has many advantages, many sites choose to use networked
> +storage appliances for Virtual Machine hosting, such as SAN and/or NAS,
> +which provide shared storage without the administrative overhead of DRBD
> +nor the limitation of a 1:1 master-slave setup. Furthermore, new
> +distributed filesystems such as Ceph are becoming viable alternatives to
> +expensive storage appliances. Support for both modes of operation, i.e.
> +shared block storage and shared file storage backend would make Ganeti a
> +robust choice for high-availability virtualization clusters.
> +
> +Throughout this document, the term "externally mirrored storage" will
> +refer to both modes of shared storage, suggesting that Ganeti does not
> +need to take care about the mirroring process from one host to another.
> +
> +Use cases
> +=========
> +We consider the following use cases:
> +
> +- A virtualization cluster with FibreChannel shared storage, mapping 1+
> +  LUN per instance to the whole cluster
> +- A virtualization cluster with instance images stored as files on an
> +  NFS server
> +- A virtualization cluster storing instance images on a Ceph cluster
> +
> +Design Overview
> +===============
> +
> +The design entails the following procedures:
> +
> +- Refactoring of all code referring to constants.DTS_NET_MIRROR
> +- Obsolescence of the primary-secondary concept for externally mirrored
> +  storage.
> +- Introduction of a disk template and a storage class for shared block
> +  device backends, providing methods for the various stages of a block
> +  device's and instance's life-cycle. In order to provide storage
> +  provisioning capabilities for various SAN appliances, external helpers
> +  in the form of a "storage driver" will be possibly introduced as well.
> +- Introduction of a shared file storage disk template for use with networked
> +  filesystems.
> +
> +Refactoring of all code referring to constants.DTS_NET_MIRROR
> +=============================================================
> +
> +Currently, all storage-related decision-making depends on a number of
> +frozensets in lib/constants.py, typically constants.DTS_NET_MIRROR.
> +However, constants.DTS_NET_MIRROR is used to signify two different
> +attributes:
> +
> +- A storage device that is shared
> +- A storage device whose mirroring is supervised by Ganeti
> +
> +We propose the introduction of two new frozensets to ease
> +decision-making:
> +
> +- constants.DTS_EXT_MIRROR, holding externally mirrored disk templates
> +- constants.DTS_MIRRORED, being a union of constants.DTS_EXT_MIRROR and
> +  DTS_NET_MIRROR.

At this point, I wonder if it doesn't make sense to rename
DTS_NET_MIRROR to DTS_INT_MIRROR (as in, internally managed).

> +Thus, checks could be grouped into the following categories:
> +
> +- Mobility checks, like whether an instance failover or migration is
> +  possible should check against constants.DTS_MIRRORED
> +- Syncing actions should be performed only for templates in
> +  constants.DTS_NET_MIRROR
> +
> +Obsolescence of the primary-secondary node model
> +================================================
> +
> +The primary-secondary node concept seems to have evolved through the use
> +of DRBD. In a globally shared storage framework without need for
> +external sync (e.g. SAN, NAS, etc.), such a notion does not apply for the
> +following reasons:
> +
> +1. Access to the storage does not necessarily imply different roles for
> +   the nodes (e.g. primary vs secondary).
> +2. The same storage is available to potentially more than 2 nodes. Thus,
> +   an instance backed by a SAN LUN for example may actually migrate to
> +   any of the other nodes and not just a pre-designated failover node.
> +
> +The proposed solution is using the iallocator framework for run-time
> +decision making during migration and failover, for nodes with disk
> +templates in constants.DTS_EXT_MIRROR. Modifications to gnt-instance and
> +gnt-node will be required to accept target node and/or iallocator
> +specification for these operations. Minor modifications of the
> +iallocator protocol may be required, such as the specification of a
> +"multiple relocation" mode.

A little bit more detail here would be good. For example, how are the
N+1 algorithms modified? How do we determine exactly to what nodes an
instance can be migrated (it might not always be all)?

The obsolescence is good indeed, and might be applied even to DRBD. We
have discussed for a long time a failover-to-any model for DRBD, but so
far there was not much need. We might be able to do it with this change…

> +Introduction of a shared block device storage class
> +===================================================
> +
> +In order to facilitate shared block device support, a new storage class
> +will be introduced to directly handle block devices and a shared block
> +device disk template will be built based on the storage model.
> +
> +In order to provide storage provisioning and manipulation (e.g. growing,
> +renaming) capabilities, each instance's disk template can possibly be
> +associated with an external "storage driver" which, based on the
> +instance's configuration and tags, will perform all supported storage
> +operations using auxiliary means (e.g. XML-RPC, ssh, etc.).
> +
> +A "storage driver" will have to provide the following methods:
> +
> +- Create a disk
> +- Remove a disk
> +- Rename a disk
> +- Resize a disk
> +- Attach a disk to a given node
> +- Detach a disk from a given node

Funny, I was just thinking that we should be able to migrate DRBD to
being externally managed with this framework. I'm not sure whether it
makes sense though :)

> +Introduction of shared file disk template
> +=========================================
> +
> +Basic shared file storage support can be implemented by creating a new
> +disk template based on the existing FileStorage class, with only minor
> +modifications in lib/bdev.py.
> +
> +.. vim: set textwidth=72 :
> diff --git a/doc/index.rst b/doc/index.rst
> index be64523..6425f31 100644
> --- a/doc/index.rst
> +++ b/doc/index.rst
> @@ -18,6 +18,7 @@ Contents:
>     design-2.1.rst
>     design-2.2.rst
>     design-cluster-merger.rst
> +   design-shared-storage.rst
>     locking.rst
>     hooks.rst
>     iallocator.rst
> -- 

LGTM overall.

thanks,
iustin

Reply via email to