Looks good! Couple small things:
On Fri, 15 Jun 2012, Josh Durgin wrote:
> Here's a draft of a patch to the docs outlining the rbd layering
> design. Is anything unclear? Any suggestions for improvement?
>
> Josh
>
> ============
> RBD Layering
> ============
>
> RBD layering refers to the creation of copy-on-write clones of block
> devices. This allows for fast image creation, for example to clone a
> golden master image of a virtual machine into a new instance. To
> simplify the semantics, you can only create a clone of a snapshot -
> snapshots are always read-only, so the rest of the image is
> unaffected, and there's no possibility of writing to them
> accidentally.
>
> Note: the terms `child` and `parent` below mean an rbd image created
> by cloning, and the rbd image snapshot a child was cloned from.
>
> Command line interface
> ----------------------
>
> Before cloning a snapshot, you must mark it as preserved, to prevent
> it from being deleted while child images refer to it:
> ::
>
> $ rbd preserve pool/image@snap
>
> Then you can perform the clone:
> ::
>
> $ rbd clone --parent pool/parent@snap pool2/child1
>
> You can create a clone with different object sizes from the parent:
> ::
>
> $ rbd clone --parent pool/parent@snap --order 25 pool2/child2
>
> To delete the parent, you must first mark it unpreserved, which checks
> that there are no children left:
> ::
>
> $ rbd unpreserve pool/image@snap
> Error unpreserving: child images rely on this image
> $ rbd list_children pool/image@snap
> pool2/child1
> pool2/child2
> $ rbd copyup pool2/child1
> $ rbd rm pool2/child2
> $ rbd unpreserve pool/image@snap
Is 'preserve' and 'unpreserve' the verbiage we want to use here? Not sure
I have a better suggestion, but preserve is unusual.
> Then the snapshot can be deleted like normal:
> ::
>
> $ rbd snap rm pool/image@snap
>
> Implementation
> --------------
>
> Data Flow
> ^^^^^^^^^
>
> In the initial implementation, called 'trivial layering', there will
> be no tracking of which objects exist in a clone. A read that hits a
> non-existent object will attempt to read from the parent object, and
> this will continue recursively until an object exists or an image with
> no parent is found.
>
> Before a write is performed, the object is checked for existence. If
> it doesn't exist, a copy-up operation is performed, which means
> reading the relevant range of data from the parent image and writing
> it (plus the original write) to the child image. To prevent races with
> multiple writes trying to copy-up the same object, this copy-up
> operation will include an atomic create. If the atomic create fails,
> the original write is done instead. This copy-up operation is
> implemented as a class method so that extra metadata can be stored by
> it in the future.
>
> A future optimization could be storing a bitmap of which objects
> actually exist in a child. This would obviate the check for existence
> before each write, and let reads go directly to the parent if needed.
>
> Parent/Child relationships
> ^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Children store a reference to their parent in their header, as a tuple
> of (pool id, image id, snapshot id). This is enough information to
> open the parent and read from it.
>
> In addition to knowing which parent a given image has, we want to be
> able to tell if a preserved image still has children. This is
> accomplished with a new per-pool object, `rbd_children`, which maps
> (parent pool, parent id, parent snapshot id) to a list of child
> image ids. This is stored in the same pool as the child image
> because the client creating a clone already has read/write access to
> everything in this pool. This lets a client with read-only access to
> one pool clone a snapshot from that pool into a pool they have full
> access to. It increases the cost of unpreserving an image, since this
> needs to check for children in every pool, but this is a rare
> operation. It would likely only be done before removing old images,
> which is already much more expensive because it involves deleting
> every data object in the image.
>
> Preservation
> ^^^^^^^^^^^^
>
> Internally, preservation_state is a field in the header object that
> can be in three states. "preserved", "unpreserved", and
> "unpreserving". The first two are set as the result of "rbd
> preserve/unpreserve". The "unpreserving" state is set while the "rbd
> unpreserve" command checks for any child images. Only snapshots in the
> "preserved" state may be cloned, so the "unpreserving" state prevents
> a race like:
>
> 1. A: walk through all pools, look for clones, find none
> 2. B: create a clone
> 3. A: unpreserve parent
> 4. A: rbd snap rm pool/parent@snap
>
> Resizing
> ^^^^^^^^
>
> To support resizing of layered images, we need to keep track of the
> minimum size the image ever was, so that if a child image is shrunk
> and then expanded, the re-expanded space is treated as unused instead
> of being read from the parent image. Since this can change over time,
> we need to store this for each snapshot as well.
>
> Renaming
> ^^^^^^^^
>
> Currently the rbd header object (that stores all the metadata about an
> image) is named after the name of the image. This makes renaming
> disrupt clients who have the image open (such as children reading from
> a parent image). To avoid this, we can name the header object by the
> id of the image, which does not change. That is, the name of the
> header object could be `rbd_header.$id`, where $id is a unique id for
> the image in the pool.
>
> When a client opens an image, all it knows is the name. There is
> already a per-pool `rbd_directory` object that maps image names to
> ids, but if we relied on it to get the id, we could not open any
> images in that pool if that single object was unavailable. To avoid
> this dependency, we can store the id of an image in an object called
> `rbd_id.$image_name`, where $image_name is the name of the image. The
> per-pool `rbd_directory` object is still useful for listing all images
> in a pool, however.
>
> Header changes
> --------------
>
> The header needs a few new fields:
>
> * uint64_t parent_pool_id
> * string parent_image_id
> * uint64_t parent_snap_id
> * uint64_t min_size (smallest size the image ever was in bytes)
> * bool has_parent
>
> Note that all the image ids are strings instead of uint64_t to let us
> easily switch to uuids in the future.
>
> cls_rbd
> ^^^^^^^
>
> Some new methods are needed:
> ::
>
> /***************** methods on the rbd header *********************/
> /**
> * Sets the parent, min_size, and has_parent keys.
> * Fails if any of these keys exist, since the image already
> * had a parent.
> */
> set_parent(uint64_t pool_id, string image_id, uint64_t snap_id)
set_parent(uint64_t pool_id, string image_id, uint64_t snap_id,
uint64_t parent_size)
The actual overlap image stores will be the min of the parent_size and its
size.
>
> /**
> * Returns the parent pool id, image id, and snap id, or -ENOENT
and overlap
> * if has_parent is false
> */
> get_parent(uint64_t snapid)
>
> /**
> * Set has_parent to false.
> */
> remove_parent() // after all parent data is copied to the child
>
> /*************** methods on the rbd_children object *****************/
>
> add_child(uint64_t parent_pool_id, string parent_image_id,
> uint64_t parent_snap_id, string image_id);
> remove_child(uint64_t parent_pool_id, string parent_image_id,
> uint64_t parent_snap_id, string image_id);
> /**
> * List image ids of a given parent
> */
> get_children(uint64_t parent_pool_id, string parent_image_id,
> uint64_t parent_snap_id, uint64_t max_return,
> string start);
> /**
> * List parent images
> */
> get_parents(uint64_t max_return, uint64_t start_pool_id,
> string start_image_id, string start_snap_id);
>
>
> /************ methods on the rbd_id.$image_name object **************/
> /**
> * Create the object and set the id. Fail and return -EEXIST if
> * the object exists.
> */
> create_id(string id)
> get_id()
>
> /***************** methods on the rbd_data objects ******************/
> /**
> * Create an object with parent_data as its contents,
> * then write child_data to it. If the exclusive create fails,
> * just write the child_data.
> */
> copy_up(char *parent_data, uint64_t parent_data_len,
> char *child_data, uint64_t child_data_offset,
> uint64_t child_data_length)
>
> One existing method will change if the image supports
> layering:
> ::
>
> snapshot_add - stores current min_size and has_parent with
> other snapshot metadata (images that don't have
> layering enabled aren't affected)
Also
set_size - will adjust the parent overlap down as needed.
>
> librbd
> ^^^^^^
>
> Opening a child image opens its parent (and this will continue
> recursively as needed). This means that an ImageCtx will contain a
> pointer to the parent image context. Differing object sizes won't
> matter, since reading from the parent will go through the parent
> image context.
>
> Discard will need to change for layered images so that it only
> truncates objects, and does not remove them. If we removed objects, we
> could not tell if we needed to read them from the parent.
>
> A new clone method will be added, which takes the same arguments as
> create except size (size of the parent image is used).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html