On 07/30/2012 06:05 PM, Josh Durgin wrote:
> On 07/26/2012 12:12 PM, Alex Elder wrote:
>> There is a bug in the way __rbd_init_snaps_header() handles newly-
>> discovered snapshots, in the unusual case where the new snapshot id
>> is less than an existing rbd snapshot id.
>>
>>
>> When a new snapshot for an rbd image is created it is assigned a new
>> snapshot id. The client requests the id from a monitor, and later
>> updates the osds with information about the snapshot. The sequence
>> of snapshot ids is monotonically increasing, which guarantees each
>> is unique. And typically all new snapshots will have higher ids
>> than any others a client has encountered.
>>
>> Because of the delay between getting assigned a snapshot id and
>> recording the snapshot, there is a possible race between two clients
>> requesting new snapshots on the same rbd image. It's possible in
>> this case for the client granted a higher id to be the first to
>> update the set of snapshots for the rbd image. This leaves a sort
>> of gap in the set of snapshots until the other client's update gets
>> recorded.
>>
>> This is an unlikely scenario, but it appears to be possible. I'm
>> not sure whether we even support multiple concurrent clients for the
>> same rbd image. However there is code in place to handle this case
>> and it's wrong, so it needs to be fixed.
>
> To fully handle this race, we'd need to sort the snapshot context and
> set its seq to the maximum id, instead of the lower id set by the race.
I had assumed both of these were true. In fact, I was under the
impression that the value of "seq" was precisely that--the maximum
snapshot id issued for this rbd image by the server.
> If we don't do this, the race makes the image unwritable since the
> non-sorted snapshot context would be rejected by the osds.
Looking at the kernel client only (which was how I found this bug,
via inspection), this code is still buggy and could cause a kernel
fault if it were to be hit.
The risk is low right now, though. So I'll just set this patch
aside now until we can get a chance to discuss the best course
of action.
-Alex
> This would complicate things a bunch, so I'm not sure if we should
> try to handle this bug on the client side. I've got a fix to prevent
> it from happening in the first place, and as you said, it's rare,
> and requires you to be taking snapshots of the same image from multiple
> clients at once, which is generally not a good idea (that is, writing
> to the image from multiple clients).
>
>> The job of __rbd_init_snaps_header() is to compare an rbd's existing
>> list of snapshots to a new set, and remove any existing snapshots
>> that are not present in the new set and create any new ones not
>> present in the existing set.
>>
>> If any new snapshots exist after the entire list of existing ones
>> has been examined, all the new ones that remain need to be created.
>> The logic for handling these last snapshots correct.
>>
>> However, if a new snapshot is found whose id is less than an
>> existing one, the logic for handling this is wrong. The basic
>> problem is that the name and current id used here are taken from
>> a place one position beyond where they should be. In the event
>> this block of code is executed the first pass through the outer
>> loop, for example, these values will refer to invalid memory.
>>
>> The fix is to make this block of code more closely match the
>> block that handles adding new snapshots at the end. There are
>> three differences:
>> - we use a do..while loop because we know we initially have an
>> entry to process.
>> - we insert the new entry at a position within the list instead
>> of at the beginning
>> - because we use it in the loop condition, we need to update
>> our notion of "current id" each pass through.
>>
>> Signed-off-by: Alex Elder <[email protected]>
>> ---
>> drivers/block/rbd.c | 55
>> +++++++++++++++++++++++++++++----------------------
>> 1 file changed, 31 insertions(+), 24 deletions(-)
>>
>> diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
>> index 71e3f3b..da09c0d 100644
>> --- a/drivers/block/rbd.c
>> +++ b/drivers/block/rbd.c
>> @@ -2090,13 +2090,14 @@ static int __rbd_init_snaps_header(struct
>> rbd_device *rbd_dev)
>> {
>> const char *name, *first_name;
>> int i = rbd_dev->header.total_snaps;
>> - struct rbd_snap *snap, *old_snap = NULL;
>> + struct rbd_snap *snap;
>> struct list_head *p, *n;
>>
>> first_name = rbd_dev->header.snap_names;
>> name = first_name + rbd_dev->header.snap_names_len;
>>
>> list_for_each_prev_safe(p, n, &rbd_dev->snaps) {
>> + struct rbd_snap *old_snap;
>> u64 cur_id;
>>
>> old_snap = list_entry(p, struct rbd_snap, node);
>> @@ -2104,12 +2105,16 @@ static int __rbd_init_snaps_header(struct
>> rbd_device *rbd_dev)
>> if (i)
>> cur_id = rbd_dev->header.snapc->snaps[i - 1];
>>
>> + /*
>> + * If old id is not present in the new context, or
>> + * if there are no more snapshots in the new
>> + * context, this snapshot should be removed.
>> + */
>> if (!i || old_snap->id < cur_id) {
>> /*
>> - * old_snap->id was skipped, thus was
>> - * removed. If this rbd_dev is mapped to
>> - * the removed snapshot, record that it no
>> - * longer exists, to prevent further I/O.
>> + * If this rbd_dev is mapped to the removed
>> + * snapshot, record that it no longer exists,
>> + * to prevent further I/O.
>> */
>> if (rbd_dev->snap_id == old_snap->id)
>> rbd_dev->snap_exists = false;
>> @@ -2122,34 +2127,36 @@ static int __rbd_init_snaps_header(struct
>> rbd_device *rbd_dev)
>> name = rbd_prev_snap_name(name, first_name);
>> continue;
>> }
>> - for (; i > 0;
>> - i--, name = rbd_prev_snap_name(name, first_name)) {
>> - if (!name) {
>> - WARN_ON(1);
>> +
>> + /*
>> + * A new snapshot (or possibly more than one)
>> + * appeared in the middle of our set of snapshots.
>> + * This could happen of two hosts raced while
>
> of -> if, and hosts -> clients (could be the same host, i.e. via the
> rbd snap create).
>
>> + * creating snapshots, and the one that was granted
>> + * a higher snapshot id managed to get its snapshot
>> + * saved before the other.
>> + */
>> + do {
>> + i--;
>> + name = rbd_prev_snap_name(name, first_name);
>> + if (WARN_ON(!name))
>> return -EINVAL;
>> - }
>> - cur_id = rbd_dev->header.snapc->snaps[i];
>> - /* snapshot removal? handle it above */
>> - if (cur_id >= old_snap->id)
>> - break;
>> - /* a new snapshot */
>> - snap = __rbd_add_snap_dev(rbd_dev, i - 1, name);
>> + snap = __rbd_add_snap_dev(rbd_dev, i, name);
>> if (IS_ERR(snap))
>> return PTR_ERR(snap);
>> -
>> /* note that we add it backward so using n and not p */
>> list_add(&snap->node, n);
>> - p = &snap->node;
>> - }
>> +
>> + cur_id = rbd_dev->header.snapc->snaps[i];
>> + } while (i && cur_id < old_snap->id);
>> }
>> /* we're done going over the old snap list, just add what's left */
>> - for (; i > 0; i--) {
>> + while (i) {
>> + i--;
>> name = rbd_prev_snap_name(name, first_name);
>> - if (!name) {
>> - WARN_ON(1);
>> + if (WARN_ON(!name))
>> return -EINVAL;
>> - }
>> - snap = __rbd_add_snap_dev(rbd_dev, i - 1, name);
>> + snap = __rbd_add_snap_dev(rbd_dev, i, name);
>> if (IS_ERR(snap))
>> return PTR_ERR(snap);
>> list_add(&snap->node, &rbd_dev->snaps);
>>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html