I'm testing out the tapdisk rbd that Sylvain wrote under Xen, and have been
having all sorts of problems as the tapdisk process is segfaulting. To make
matters worse, any attempt to use gdb on the resulting core just tells me it
can't find the threads ('generic error'). Google tells me that I can get around
this error by linking the main exe (tapdisk) with libpthread, but that doesn't
help.
With strategic printf's I have confirmed that in most cases the crash happens
after a call to rbd_aio_read or rbd_aio_write and before the callback is
called. Given the async nature of tapdisk it's impossible to be sure but I'm
confident that the crash is not happening in any of the tapdisk code. It's
possible that there is an off-by-one error in a buffer somewhere with the
corruption showing up later but there really isn't a lot of code there and I've
been over it very closely and it appears quite sound.
I have also tested for multiple complete's for the same request, and corrupt
pointers being passed into the completion routine, and nothing shows up there
either.
In most cases there is nothing pre-empting the crash, aside from a tendency to
seemingly crash more often when the cluster is disturbed (eg a mon node is
rebooted). I have one VM which will be unbootable for long periods of time with
the crash happening during boot, typically when postgres starts. This can be
reproduced for hours and is useful for debugging, but then suddenly the problem
goes away spontaneously and I can no longer reproduce it even after hundreds of
reboots.
I'm using Debian and the problem exists with both the latest cuttlefish and
dumpling deb's.
So... does librbd have any internal self-checking options I can enable? If I'm
going to start injecting printf's around the place, can anyone suggest what
code paths are most likely to be causing the above?
Thanks
James
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html