Le mardi 25 mars 2014 à 20:55 -0500, Alex Elder a écrit :
> On 03/25/2014 08:50 PM, Olivier Bonvalet wrote:
> > Le mercredi 26 mars 2014 à 02:33 +0100, Olivier Bonvalet a écrit :
> >> Thanks for your patch.
> >>
> >> This is an output of a crash case :
> >>
> >> Mar 26 02:31:18 alg kernel: [  965.366895] rbd_img_obj_callback: bad image 
> >> object request information:
> >> Mar 26 02:31:18 alg kernel: [  965.366905] obj_request ffff880224bc9528
> >> Mar 26 02:31:18 alg kernel: [  965.366909]     ->object_name <(null)>
> >> Mar 26 02:31:18 alg kernel: [  965.366913]     ->offset 0
> >> Mar 26 02:31:18 alg kernel: [  965.366917]     ->length 4096
> >> Mar 26 02:31:18 alg kernel: [  965.366921]     ->type 0x1
> >> Mar 26 02:31:18 alg kernel: [  965.366925]     ->flags 0x3
> >> Mar 26 02:31:18 alg kernel: [  965.366929]     ->img_request           
> >> (null)
> >> Mar 26 02:31:18 alg kernel: [  965.366933]     ->which 4294967295
> >> Mar 26 02:31:18 alg kernel: [  965.366936]     ->xferred 4096
> >> Mar 26 02:31:18 alg kernel: [  965.366940]     ->result 0
> >> Mar 26 02:31:18 alg kernel: [  965.366943]     ->kref 0
> >> Mar 26 02:31:18 alg kernel: [  965.366947] img_request ffff880222f4fb50
> >> Mar 26 02:31:18 alg kernel: [  965.366950]     ->snap 0xfffffffffffffffe
> >> Mar 26 02:31:18 alg kernel: [  965.366954]     ->offset 1417662464
> >> Mar 26 02:31:18 alg kernel: [  965.366957]     ->length 16384
> >> Mar 26 02:31:18 alg kernel: [  965.366960]     ->flags 0x0
> >> Mar 26 02:31:18 alg kernel: [  965.366963]     ->obj_request_count 0
> >> Mar 26 02:31:18 alg kernel: [  965.366966]     ->next_completion 2
> >> Mar 26 02:31:18 alg kernel: [  965.366969]     ->xferred 16384
> >> Mar 26 02:31:18 alg kernel: [  965.366973]     ->result 0
> >> Mar 26 02:31:18 alg kernel: [  965.366976]     ->obj_requests head 
> >> ffff880222f4fbb0
> >> Mar 26 02:31:18 alg kernel: [  965.366980]     ->kref 0
> >> Mar 26 02:31:18 alg kernel: [  965.366985] 
> >> Mar 26 02:31:18 alg kernel: [  965.366985] Assertion failure in 
> >> rbd_img_obj_callback() at line 2165:
> >> Mar 26 02:31:18 alg kernel: [  965.366985] 
> >> Mar 26 02:31:18 alg kernel: [  965.366985]         rbd_assert(which == 
> >> img_request->next_completion);
> >> Mar 26 02:31:18 alg kernel: [  965.366985] 
> >> Mar 26 02:31:18 alg kernel: [  965.367185] ------------[ cut here 
> >> ]------------
> >> Mar 26 02:31:18 alg kernel: [  965.367241] kernel BUG at 
> >> drivers/block/rbd.c:2165!
> >>
> >>
> >> I hope it can help.
> >>
> >>
> 
> 
> Thanks for sending these.
> 
> > 
> > and a second one, very similar :
> > 
> > Mar 26 02:48:27 alg kernel: [  681.167833] rbd_img_obj_callback: bad image 
> > object request information:
> > Mar 26 02:48:27 alg kernel: [  681.167836] obj_request ffff88022e1e2828
> > Mar 26 02:48:27 alg kernel: [  681.167837]     ->object_name <(null)>
> > Mar 26 02:48:27 alg kernel: [  681.167838]     ->offset 0
> > Mar 26 02:48:27 alg kernel: [  681.167839]     ->length 4096
> > Mar 26 02:48:27 alg kernel: [  681.167840]     ->type 0x1
> > Mar 26 02:48:27 alg kernel: [  681.167840]     ->flags 0x3
> > Mar 26 02:48:27 alg kernel: [  681.167841]     ->img_request           
> > (null)
> > Mar 26 02:48:27 alg kernel: [  681.167842]     ->which 4294967295
> > Mar 26 02:48:27 alg kernel: [  681.167843]     ->xferred 4096
> > Mar 26 02:48:27 alg kernel: [  681.167844]     ->result 0
> > Mar 26 02:48:27 alg kernel: [  681.167844]     ->kref 0
> 
> This confirms the reference count of the object request has gone
> to zero.  This object request has already been destroyed (yet
> we're handling a callback for it).
> 
> > Mar 26 02:48:27 alg kernel: [  681.167845] img_request ffff88021f555f10
> > Mar 26 02:48:27 alg kernel: [  681.167846]     ->snap 0xfffffffffffffffe
> > Mar 26 02:48:27 alg kernel: [  681.167847]     ->offset 28072464384
> > Mar 26 02:48:27 alg kernel: [  681.167847]     ->length 16384
> > Mar 26 02:48:27 alg kernel: [  681.167848]     ->flags 0x0
> > Mar 26 02:48:27 alg kernel: [  681.167849]     ->obj_request_count 0
> > Mar 26 02:48:27 alg kernel: [  681.167850]     ->next_completion 2
> > Mar 26 02:48:27 alg kernel: [  681.167850]     ->xferred 16384
> > Mar 26 02:48:27 alg kernel: [  681.167851]     ->result 0
> > Mar 26 02:48:27 alg kernel: [  681.167852]     ->obj_requests head 
> > ffff88021f555f70
> 
> The object request list is empty.
> 
> > Mar 26 02:48:27 alg kernel: [  681.167853]     ->kref 0
> 
> This confirms the reference count of the image request has gone
> to zero.  So not only has the object request already completed,
> the image request has as well.
> 
> I'm almost done composing a very large e-mail with some detailed
> analysis.  No answer quite yet, but I am certain that we're
> getting duplicate callbacks on the second object request of
> an image request that spans two objects.  That should help
> narrow the search for the root cause.
> 
>                                       -Alex

Thanks again to took time to analyze that problem.

All my RBD images have daily snapshots, can this bug be related to
snapshots ?

Maybe it's a stupid question, but is there a workaround that I could use
to reduce that problem in production, until a proper fix is found ?


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to