Re: [RFC]About ImageIndex

Haomai Wang Wed, 13 Aug 2014 05:39:29 -0700

Thanks! When this bp arrived finished, it's a good time to implement
the functional checks.


On Wed, Aug 13, 2014 at 7:32 PM, Dhiraj Kamble
<[email protected]> wrote:
> Hi Josh, Haomai,
>
> I am currently working on a CLI to detect inconsistencies in rbd and have 
> some functional checks working. We can leverage it to detect inconsistency 
> with ImageIndex/header.
> Let me know if I can help you with this.
>
>
> Regards,
> Dhiraj
>
>
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Josh Durgin
> Sent: Wednesday, August 13, 2014 7:45 AM
> To: Haomai Wang; Sage Weil
> Cc: [email protected]
> Subject: Re: [RFC]About ImageIndex
>
> On 08/11/2014 07:50 PM, Haomai Wang wrote:
>> Hi Sage, Josh:
>>
>> ImageIndex is aimed to hold each object's location info which avoid
>> extra checking for none-existing object. It's only used when image
>> flags exists LIBRBD_CREATE_NONSHARED. Otherwise, ImageIndex will
>> become gawp and has no effect.
>
> I like the general structure of the code, and it's great to hear you've been 
> testing it with test_librbd_fsx. Some thoughts on the design:
>
>> Each object has three state:
>> 1. UNKNOWN: default value, it will follow origin path 2. LOCAL: imply
>> this object is local, don't need to lookup parent image 3. PARENT:
>> imply this object is in the parent image, don't need to read from
>> local image
>
> What about calling these unknown, exists, and nonexistent instead?
> This information can be used for non-clone use cases too. Internally there's 
> another ImageCtx for each parent, which can have its own index, so I don't 
> think there's a need to keep track of any objects from different images in 
> one ImageCtx.
>
>> Note: ImageIndex isn't full sync to real data all the time. Because
>> the transformation {"unknown" -> "local", "unknown" -> "parent"} are safe.
>> So We only need to handle with the exception when ImageIndex implies
>> this object is "parent" but the real data is "local". There exists
>> three methods to solve it:
>> 1. flush `state_map` every time when "parent" -> "local" happened 2.
>> mark all objects from "parent" state to "unknown“ state when loading
>> image index(not including snapshot which has frozen index).
>
> At the end of the CDS session I was thinking that implementing it strictly 
> would be better, so that you could always trust the on-disk copy of the 
> index. In particular we were worried about the complexity of handling a 
> non-strict index during things like live migration, and being able to use it 
> to speed up management operations more reliably.
>
> Since creating new objects (or discarding entire existing objects) is pretty 
> rare, it seems worth the extra I/Os to update the index synchronously in 
> these cases. To be fully safe I think any write to a new object or discard of 
> a full object would need to change index state to unknown, do the 
> write/discard, and finally change the index state to exists/nonexistent. 
> There might be some ways to optimize that though.
>
>> Here choose to implement method 2. This method only allow 2 read ops
>> in one read request at max and without overhead. Usually, librbd will
>> open image for many days(months) for normal use case such as VM usage.
>> So the image index will be warmed up and became smart when processing
>> ops.
>>
>> Except image state changed problem, another concern is size. Image
>> index only permit single client write, but resize/flatten/rollback ops
>> are allowed to happen concurrently. For simply, now these ops don't
>> change and save states into rados. Resize op will affect "size" and
>> current write client will be notified.
>>
>> Below listing object state change scenes:
>> 1. When clone from image, it will mark all objects as "parent"
>
> We can do this for any new image (mark all objects non-existent initially), 
> not just clones.
>
>> 2. When creating snapshot, image index will be freeze and save it as
>> the index of the snapshot. All "parent" state objects will be marked
>> as "unknown" for safe
>
> One implementation detail here - I think it'd be good to store snapshot 
> indexes in separate objects so that having many snapshots does not cause the 
> header object to become too large.
>
> To recover from a partial snapshot, where the header was updated but the 
> index wasn't saved, it would be useful to have a rbd cli command to repair 
> (or optionally just check) the index for a snapshot.
>
>> 3. When write(including modified op) a object, it will mark this
>> object as "local"
>> 4. When reading object, the current image object will be always read
>> in spite of the state. And the parent image's object will be checked
>> and trust the state. If exists parent image but read local object
>> successfully, the local object will be marked "local"
>>
>> The principle is that only exists one client can operate and save the
>> ImageIndex changes into rados. Now the challenge is that how to decide
>> who is the owner of ImageIndex. Like VM usage, qemu will open image to
>> do IO ops and externally user also can do create snapshot for the
>> opened image(rbd snap create ...). So it will exists two client modify
>> the same image, one can be regarded as "owner client" another is
>> "management client". "management client" is expected to not change the
>> state of object.
>>
>> I only come up with a idea that user need to call "set_owner_image"
>> when client want to become "owner client" and this client can operate
>> ImageIndex.
>
> This is a great idea for all header updates. I've been thinking about the 
> same thing for rbd mirroring, and Sage started working on some watch/notify 
> changes that will help make this robust:
>
> http://pad.ceph.com/p/watch-notify
>
>> PR(https://github.com/ceph/ceph/pull/2212)
>> I have passed test_librbd and test_librbd_fsx tests. The IO logic
>> seemed worked as expected.
>
> I haven't had time to look at the code in a lot of detail yet, but it'd be 
> nice to use just 2 bits per object on disk. I'm on vacation the next couple 
> weeks, but Sage may have more feedback.
>
> Josh
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
> body of a message to [email protected] More majordomo info at  
> http://vger.kernel.org/majordomo-info.html
>
> ________________________________
>
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
>



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC]About ImageIndex

Reply via email to