On 03/25/2014 08:18 AM, Olivier Bonvalet wrote:
> 
> 
> Le mardi 25 mars 2014 à 14:57 +0200, Ilya Dryomov a écrit :
>> On Tue, Mar 25, 2014 at 2:51 PM, Alex Elder <[email protected]> wrote:
>>> On 03/25/2014 07:34 AM, Ilya Dryomov wrote:
>>>>> On 03/25/2014 04:04 AM, Ilya Dryomov wrote:
>>>>>> On Tue, Mar 25, 2014 at 10:39 AM, Olivier Bonvalet <[email protected]> 
>>>>>> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> what can/should I do to help fix that problem ?
>>>>>>>
>>>>>>> for now, RBD kernel client hang on :
>>>>>>>         Assertion failure in rbd_img_obj_callback() at line 2131:
>>>>>>>            rbd_assert(which >= img_request->next_completion);
>>>>>
>>>>> If you can build your own kernel as Ilya says I'd like to
>>>>> see the values of which and img_request->next_completion
>>>>> here.
>>>>
>>>> Looks like which was 1, which means that next_completion had to be 2 or
>>>> greater.  I miss solaris crash dumps ...
>>>>
>>>> On a different note, why are we asserting next_completion outside of
>>>> a spinlock which is supposed to protect next_completion?
>>>
>>> That's a very good point (which could be easily remedied by moving
>>> the assertion down a couple lines).  The image object request (#1)
>>> in this case will have been marked done at this point; it's possible
>>> that request #2 (or later) was concurrently getting handled by the
>>> for_each_obj_request_from() loop below in that same function, but
>>> may not have updated next_completion yet.
>>>
>>> So that *could* explain the tripped assertion.  The assertion
>>> should be moved in any case, it's a bug.
>>>
>>> That being said, it doesn't explain the other assertion:
>>>         rbd_assert(img_request != NULL);
>>> So there's at least one other thing going on.
>>
>> Yeah, exactly my thoughts.
>>
>> Thanks,
>>
>>                 Ilya
> 
> So, a (partial) fix can be this patch ?
> 
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -2123,6 +2123,7 @@ static void rbd_img_obj_callback(struct rbd_obj_request 
> *obj_request)
>         rbd_assert(obj_request_img_data_test(obj_request));
>         img_request = obj_request->img_request;
>  
> +       spin_lock_irq(&img_request->completion_lock);
>         dout("%s: img %p obj %p\n", __func__, img_request, obj_request);
>         rbd_assert(img_request != NULL);
>         rbd_assert(img_request->obj_request_count > 0);
> @@ -2130,7 +2131,6 @@ static void rbd_img_obj_callback(struct rbd_obj_request 
> *obj_request)
>         rbd_assert(which < img_request->obj_request_count);
>         rbd_assert(which >= img_request->next_completion);
>  
> -       spin_lock_irq(&img_request->completion_lock);
>         if (which != img_request->next_completion)
>                 goto out;


Yes, roughly.  I'd do the following instead.  It would be great
to learn whether it eliminates the one form of assertion failure
you were seeing.

                                        -Alex

--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -2128,11 +2128,11 @@ static void rbd_img_obj_callback(struct
        rbd_assert(img_request->obj_request_count > 0);
        rbd_assert(which != BAD_WHICH);
        rbd_assert(which < img_request->obj_request_count);
-       rbd_assert(which >= img_request->next_completion);

        spin_lock_irq(&img_request->completion_lock);
        if (which != img_request->next_completion)
                goto out;
+       rbd_assert(which > img_request->next_completion);

        for_each_obj_request_from(img_request, obj_request) {
                rbd_assert(more);


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to