On Fri, 16 Aug 2013, James Harper wrote:
> >
> > On Fri, 16 Aug 2013, James Harper wrote:
> > > I'm testing out the tapdisk rbd that Sylvain wrote under Xen, and have
> > > been having all sorts of problems as the tapdisk process is segfaulting.
> > > To
> > > make matters worse, any attempt to use gdb on the resulting core just
> > > tells
> > > me it can't find the threads ('generic error'). Google tells me that I
> > > can get
> > > around this error by linking the main exe (tapdisk) with libpthread, but
> > > that
> > > doesn't help.
> > >
> > > With strategic printf's I have confirmed that in most cases the crash
> > > happens after a call to rbd_aio_read or rbd_aio_write and before the
> > > callback is called. Given the async nature of tapdisk it's impossible to
> > > be sure
> > > but I'm confident that the crash is not happening in any of the tapdisk
> > > code.
> > > It's possible that there is an off-by-one error in a buffer somewhere
> > > with the
> > > corruption showing up later but there really isn't a lot of code there
> > > and I've
> > > been over it very closely and it appears quite sound.
> > >
> > > I have also tested for multiple complete's for the same request, and
> > > corrupt pointers being passed into the completion routine, and nothing
> > > shows up there either.
> > >
> > > In most cases there is nothing pre-empting the crash, aside from a
> > > tendency to seemingly crash more often when the cluster is disturbed (eg a
> > > mon node is rebooted). I have one VM which will be unbootable for long
> > > periods of time with the crash happening during boot, typically when
> > > postgres starts. This can be reproduced for hours and is useful for
> > > debugging,
> > > but then suddenly the problem goes away spontaneously and I can no longer
> > > reproduce it even after hundreds of reboots.
> > >
> > > I'm using Debian and the problem exists with both the latest cuttlefish
> > > and
> > > dumpling deb's.
> > >
> > > So... does librbd have any internal self-checking options I can enable?
> > > If I'm
> > > going to start injecting printf's around the place, can anyone suggest
> > > what
> > > code paths are most likely to be causing the above?
> >
> > This is usually about the time when we trying running things under
> > valgrind. Is that an option with tapdisk?
>
> Never used it before. I guess I can find out pretty easy, I'll try that next.
>
> > Of course, the old standby is to just crank up the logging detail and try
> > to narrow down where the crash happens. Have you tried that yet?
>
> I haven't touched the rbd code. Is increased logging a compile-time
> option or a config option?
That is probably the first you should try then. In the [client] section
of ceph.conf on the node where tapdisk is running add something like
[client]
debug rbd = 20
debug rados = 20
debug ms = 1
log file = /var/log/ceph/client.$name.$pid.log
and make sure the log directory is writeable.
> > There is a probable issue with aio_flush and caching enabled that Mike
> > Dawson is trying to reproduce. Are you running with caching on or off?
>
> I have not enabled caching, and I believe it's disabled by default.
There is a fix for an aio hang that just hit the cuttlefish branch today
that could conceivably be the issue. It causes a hang on qemu but maybe
tapdisk is more sensitive? I'd make sure you're running with that in any
case to rule it out.
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html