On 03/25/2014 08:18 AM, Olivier Bonvalet wrote:
>
>
> Le mardi 25 mars 2014 à 14:57 +0200, Ilya Dryomov a écrit :
>> On Tue, Mar 25, 2014 at 2:51 PM, Alex Elder <[email protected]> wrote:
>>> On 03/25/2014 07:34 AM, Ilya Dryomov wrote:
>>>>> On 03/25/2014 04:04 AM, Ilya Dryomov wrote:
>>>>>> On Tue, Mar 25, 2014 at 10:39 AM, Olivier Bonvalet <[email protected]>
>>>>>> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> what can/should I do to help fix that problem ?
>>>>>>>
>>>>>>> for now, RBD kernel client hang on :
>>>>>>> Assertion failure in rbd_img_obj_callback() at line 2131:
>>>>>>> rbd_assert(which >= img_request->next_completion);
>>>>>
>>>>> If you can build your own kernel as Ilya says I'd like to
>>>>> see the values of which and img_request->next_completion
>>>>> here.
>>>>
>>>> Looks like which was 1, which means that next_completion had to be 2 or
>>>> greater. I miss solaris crash dumps ...
>>>>
>>>> On a different note, why are we asserting next_completion outside of
>>>> a spinlock which is supposed to protect next_completion?
>>>
>>> That's a very good point (which could be easily remedied by moving
>>> the assertion down a couple lines). The image object request (#1)
>>> in this case will have been marked done at this point; it's possible
>>> that request #2 (or later) was concurrently getting handled by the
>>> for_each_obj_request_from() loop below in that same function, but
>>> may not have updated next_completion yet.
>>>
>>> So that *could* explain the tripped assertion. The assertion
>>> should be moved in any case, it's a bug.
>>>
>>> That being said, it doesn't explain the other assertion:
>>> rbd_assert(img_request != NULL);
>>> So there's at least one other thing going on.
>>
>> Yeah, exactly my thoughts.
>>
>> Thanks,
>>
>> Ilya
>
> So, a (partial) fix can be this patch ?
>
> --- a/drivers/block/rbd.c
> +++ b/drivers/block/rbd.c
> @@ -2123,6 +2123,7 @@ static void rbd_img_obj_callback(struct rbd_obj_request
> *obj_request)
> rbd_assert(obj_request_img_data_test(obj_request));
> img_request = obj_request->img_request;
>
> + spin_lock_irq(&img_request->completion_lock);
> dout("%s: img %p obj %p\n", __func__, img_request, obj_request);
> rbd_assert(img_request != NULL);
> rbd_assert(img_request->obj_request_count > 0);
> @@ -2130,7 +2131,6 @@ static void rbd_img_obj_callback(struct rbd_obj_request
> *obj_request)
> rbd_assert(which < img_request->obj_request_count);
> rbd_assert(which >= img_request->next_completion);
>
> - spin_lock_irq(&img_request->completion_lock);
> if (which != img_request->next_completion)
> goto out;
Yes, roughly. I'd do the following instead. It would be great
to learn whether it eliminates the one form of assertion failure
you were seeing.
-Alex
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -2128,11 +2128,11 @@ static void rbd_img_obj_callback(struct
rbd_assert(img_request->obj_request_count > 0);
rbd_assert(which != BAD_WHICH);
rbd_assert(which < img_request->obj_request_count);
- rbd_assert(which >= img_request->next_completion);
spin_lock_irq(&img_request->completion_lock);
if (which != img_request->next_completion)
goto out;
+ rbd_assert(which > img_request->next_completion);
for_each_obj_request_from(img_request, obj_request) {
rbd_assert(more);
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html