We haven't come across this issue so far.. can you post the complete backtrace from your debugger?
Avati On Sun, Jul 3, 2011 at 1:21 PM, Emmanuel Dreyfus <[email protected]> wrote: > Hi > > I get a reprodcutbile crash of glusterfsd, running 3.2.1 code. I get it > by running multiple tar -xzf on a client, and after a while, a > glusterfsd on a brick crashes: > > Program terminated with signal 11, Segmentation fault. > #0 0xba0d652e in server_rchecksum_cbk (frame=0xbad007d0, > cookie=0xbaf00300, this=0xba810000, op_ret=-1, op_errno=9, > weak_checksum=0, strong_checksum=0xb91ffc74 "") at > server3_1-fops.c:1305 > > Here is the offending code > > if (op_ret == -1) > gf_log (this->name, GF_LOG_INFO, > "%"PRId64": RCHECKSUM %"PRId64" (%"PRId64") ==> > %"PRId32" (%s)", > frame->root->unique, state->resolve.fd_no, > state->fd ? state->fd->inode->ino : 0, op_ret, > strerror (op_errno)); > > The problem is state->fd->inode value: > > (gdb) print *((server_state_t *)frame->root->state)->fd > $7 = {pid = 2610, flags = 2, refcount = 2, inode_list = > {next = 0xb9801088, prev = 0xb9801088}, inode = 0xaaaaaaaa, > lock = {pts_magic = 3735879687, pts_spin = 0 '\0', pts_flags = > 0}, _ctx = 0xbb96b080, xl_count = 8} > > inode = 0xaaaaaaaa is set in fd_destroy() to denote a stale object (It > is less fun than using 0xdeadbeef :-) > > That suggests a race condition where a thread uses a fd that another > thread destroyed. Of course, the value could be checked at the beginning > of server_rchecksum_cbk(), but I suspect the problem is more widespread > that this. There are many other places in server3_1-fops.c where > state->fd->inode->ino is used. > > And should the value be checked at the beginning of > server_rchecksum_cbk() and its friends, or in any gf_log() call, like > this: > if (op_ret == -1) > gf_log (this->name, GF_LOG_INFO, > "%"PRId64": RCHECKSUM %"PRId64" (%"PRId64") " > "==> %"PRId32" (%s)", > frame->root->unique, state->resolve.fd_no, > state->fd && (state->fd->inode != 0xaaaaaaaa) ? > state->fd->inode->ino : 0, op_ret, > strerror (op_errno)); > > FWIW this is a 2x2 replicated and distributed setup. > > -- > Emmanuel Dreyfus > http://hcpnet.free.fr/pubz > [email protected] > > _______________________________________________ > Gluster-devel mailing list > [email protected] > https://lists.nongnu.org/mailman/listinfo/gluster-devel >
_______________________________________________ Gluster-devel mailing list [email protected] https://lists.nongnu.org/mailman/listinfo/gluster-devel
