On 09/09/2013 11:15 AM, Josh Durgin wrote:
> On 09/09/2013 06:37 AM, Alex Elder wrote:
>> On 09/09/2013 02:17 AM, Josh Durgin wrote:
>>> Removing a device deallocates the disk, unschedules the watch, and
>>> finally cleans up the rbd_dev structure. rbd_dev_refresh(), called
>>> from the watch callback, updates the disk size and rbd_dev
>>> structure. With no locking between them, rbd_dev_refresh() may use the
>>> device or rbd_dev after they've been freed.
. . .
>>> + if (!removing) {
>>
>> Here's the problem. It is the same one faced by the open path.
>>
>> You determined above whether or not the device was getting removed.
>> But you haven't left any state that indicates that the image is
>> getting refreshed (or more specifically, getting its size updated).
>>
>> So, if the device is mapped but isn't actually opened by anybody
>> there's nothing stopping the code in rbd_remove() from going
>> ahead anyway. So the possibility of the structures getting freed
>> remains (though the window is a little smaller now).
>
> I think this is safe with the use of ceph_osdc_flush_notifies()
> before rbd_bus_del_dev(). Since the watch is already unregistered,
> flushing the notifies waits for all calls to rbd_watch_cb() to
> complete. This means that rbd_remove() can't actually free
> any of the rbd_dev structures until after the last revalidate_disk()
> etc. is done. Does this make sense, or am I missing something?
Yes, now I remember. You're right. That was the point of
doing the flush_notifies call before release anyway...
So my "nothing stopping the code in rbd_remove" was wrong.
In fact, that ceph_osdc_flush_notifies() call is stopping
it from freeing things that may still be in use.
Thanks for explaining it to me (again).
> This should probably go in a comment in rbd_remove(), since
> it's not obvious from the code there why the ordering of
> the last few calls makes sense.
Yes. I think so.
Reviewed-by: Alex Elder <[email protected]>
> Josh
>
>> I think you can resolve this problem by using the open count.
>> That is, use the same interlock between the open count and the
>> removing flag that's used for rbd_open() and rbd_remove().
>> The open count in some sense represents someone still needing
>> the data structures for the image. So treat the refresh activity
>> as one of those "somebodies" by claiming one of those opens
>> while the size update is going on.
>>
>> If you encapsulated some code now in place for this purpose,
>> something like the following might be helpful. (I've renamed
>> "open_count" to be "inuse_count" here.)
>>
>> static inline int rbd_request_access(struct rbd_device *rbd_dev)
>> {
>> int ret = 0;
>>
>> spin_lock_irq(&rbd_dev->lock);
>> if (!test_bit(RBD_DEV_FLAG_REMOVING, &rbd_dev->flags))
>> rbd_dev->inuse_count++;
>> else
>> ret = -ENOENT;
>> spin_unlock_irq(&rbd_dev->lock);
>>
>> return ret;
>> }
>>
>> static inline int rbd_release_access(struct rbd_device *rbd_dev)
>> {
>> int ret = 0;
>>
>> spin_lock_irq(&rbd_dev->lock);
>> if (rbd_dev->inuse_count)
>> rbd_dev->inuse_count--;
>> else
>> ret = -EINVAL;
>> spin_unlock_irq(&rbd_dev->lock);
>>
>> return ret;
>> }
>>
>> static inline int rbd_prevent_access(struct rbd_device *rbd_dev)
>> {
>> int ret = 0;
>>
>> spin_lock_irq(&rbd_dev->lock);
>> if (rbd_dev->inuse_count)
>> ret = -EBUSY;
>> else if (test_and_set_bit(RBD_DEV_FLAG_REMOVING, &rbd_dev->flags))
>> ret = -EINPROGRESS;
>> spin_unlock_irq(&rbd_dev->lock);
>>
>> return ret;
>> }
>>
>> I'll assume you know what I'm talking about at this point and
>> won't go into exactly where and how you'd use these.
>>
>>> + size = (sector_t)rbd_dev->mapping.size / SECTOR_SIZE;
>>> + dout("setting size to %llu sectors", (unsigned long long)size);
>>> + set_capacity(rbd_dev->disk, size);
>>> + revalidate_disk(rbd_dev->disk);
>>> + }
>>> +}+-
>>> +
>>> static int rbd_dev_refresh(struct rbd_device *rbd_dev)
>>> {
>>> u64 mapping_size;
>>
>> Everything below is good.
>>
>>> @@ -3344,12 +3369,7 @@ static int rbd_dev_refresh(struct rbd_device
>>> *rbd_dev)
>>> up_write(&rbd_dev->header_rwsem);
>>>
>>> if (mapping_size != rbd_dev->mapping.size) {
>>> - sector_t size;
>>> -
>>> - size = (sector_t)rbd_dev->mapping.size / SECTOR_SIZE;
>>> - dout("setting size to %llu sectors", (unsigned long long)size);
>>> - set_capacity(rbd_dev->disk, size);
>>> - revalidate_disk(rbd_dev->disk);
>>> + rbd_dev_update_size(rbd_dev);
>>> }
>>>
>>> return ret;
>>> @@ -5160,7 +5180,6 @@ static ssize_t rbd_remove(struct bus_type *bus,
>>> if (ret < 0 || already)
>>> return ret;
>>>
>>> - rbd_bus_del_dev(rbd_dev);
>>> ret = rbd_dev_header_watch_sync(rbd_dev, false);
>>> if (ret)
>>> rbd_warn(rbd_dev, "failed to cancel watch event (%d)\n", ret);
>>> @@ -5171,6 +5190,7 @@ static ssize_t rbd_remove(struct bus_type *bus,
>>> */
>>> dout("%s: flushing notifies", __func__);
>>> ceph_osdc_flush_notifies(&rbd_dev->rbd_client->client->osdc);
>>> + rbd_bus_del_dev(rbd_dev);
>>> rbd_dev_image_release(rbd_dev);
>>> module_put(THIS_MODULE);
>>>
>>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html