rbd layering

Sage Weil Fri, 25 Feb 2011 14:25:34 -0800

I wanted to follow up on the thread a couple weeks back and summarize 
where we're currently at.  The goal is to be flexible, so that we don't 
impose any performance limits for features we don't use.


The use cases are:

 - (fast) image creation from gold master (probably followed by growing 
the image/fs)
 - image migration (create child in new location; copyup old data 
asynchronously)


Here are the pieces we currently have:

(image == rbd image
 object == one object in the image, normally 4MB)

- Parent image pointer

Each image has an option parent pointer that names a parent image.  The 
parent must be part of the same cluster, but can be in a different pool.  
It can be larger or smaller than the current image. 

It is assumed the parent is read-only.  I don't think anything sane can 
come out of doing a COW overlay over something that is changing.

- Object Bitmap

Each object in an image may have an OPTIONAL bitmap that represents 
transparency.  If the bit is set, then it is defined by this image layer 
(it can be either object data or, if the object has a hole, zeros).  If 
the bit is not set, then the content is defined by the parent image.  The 
resolution can be sector, 4KB block, or anything else.  If it is larger 
than the smallest write unit, a write may require copy-up from the lower 
layer, so using the block size is recommended.

If the object bitmap does not exist, we assume the object is NOT 
transparent (i.e. bitmap is fully colored).  That gives us compatibility 
with old images, and lets us drop the bitmap once it gets fully colored.  
Only new images that support layering will create/use it.  

- Image bitmap

Each image may have an OPTIONAL bitmap that indicates which image objects 
(may) exist.  On write, a bit is set prior to creating the each object.  
On read, if a bitmap exists but the bit for an object is not set, we can 
go directly to the parent image.  If the bitmap does not exist, reads must 
always check for the child object before falling through to the parent 
image.  Writes in the no-bitmap case write to the child object.  If The 
bitmap size need not match the image size; it may, e.g., match the size of 
a smaller parent image.

Having two bitmaps is a design tradeoff.  We could a sector/block 
resolution bitmap for the whole image, but it would increase memory use, 
and would require more "update image bitmap, wait, then write to object" 
cycles.  Having a per-object bitmap means we can atomically update the 
object bitmap for free when we do the write, and minimize the image bitmap 
updates to the first time each object is touched.

On read:
        if there is an image bitmap
                if bit is set
                        read child object
                        if there's an object bitmap that indicates transparency
                                read holes from parent object
                else
                        read parent object (*)
        else
                read child object
                if there is no child object, or bitmap indicates transparency
                        read holes from parent object (*)

On write:
        if there is an image bitmap and bit is not set
                color image bitmap bit for this object
        if object bitmaps are enabled
                write to object
                color object bits too
        else
                if we are not writing the entire object    (*)
                        read unwritten parts from parent   (*)
                write our data (+ copyup data from parent)

(*) These steps can be skipped if the parent image has holes here.  We 
would know that if the parent image bitmap bits are not set, or if we are 
past the end of the parent image size.

On trim/discard:
        if there is an image bitmap
                if bit is not set
                        set image bitmap bit            
        truncate or zero object
        if object bitmap
                color appropriate bits


Also: the image bitmap could be created after the fact.  I.e. once we 
decide to use something as a gold image/parent, we would generate the 
image bitmap (just check which objects exist) so that overlays would 
operate more efficiently.  We'll probably want a read-only flag in the 
image header too to help keep admins from shooting themselves in the foot.


- OSD copyup/merge operation

The last piece would be an OSD method to atomically copy a parent object 
up to the overlay image.  The goal is for the copyup to be a background, 
maybe low-priority process.  We would read the parent object, then submit 
it to the child object, only write the parts that correspond to non-set 
bits in the object bitmap, and then color in all bits.


That's the current design.  Thoughts on or errors with the above?

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

rbd layering

Reply via email to