Hey Becky,

Thanks for the recommendation. I turned on the debugging, but it could be
days or longer before another segfault occurs. The three errors I have seen
occurred over the course of about 5 weeks.

Is there any way I could test either of those scenarios you mentioned or
otherwise narrow it down? I would really like to get this figured out before
any other file systems start experiencing problems.

Thanks,
Bart.


On Tue, Sep 21, 2010 at 2:39 PM, Becky Ligon <[email protected]> wrote:

> Bart:
>
> Try turning on the state machine gossip debug statements, so you can see
> which request is trying to be processed:  ...bin/pvfs2-set-debugmask -m
> <mountpoint> sm, while the servers are running.
>
> My guess is that the pvfs lib has gotten corrupted somehow or there is
> some sort of mismatch between the client request that is sent to the
> server.
>
> Becky
> --
> Becky Ligon
> PVFS Developer
> Clemson University
> 864-650-4065
>
> > Hey guys,
> >
> > I am running into some server segfaults with the latest release. I have a
> > single RHEL 5.5 64bit virtual machine with 4GB memory running four
> daemons
> > using network attached storage. This is 2.8.2 plus a couple of recent
> > patches.
> >
> > The log output looks like this:
> >
> > Sep 21 07:42:54 node1 PVFS2: [E] SM current state or trtbl is invalid
> > (smcb
> > = 0x11d9a390)
> > Sep 21 07:42:54 node1 PVFS2: [E]      [bt]
> > /usr/sbin//pvfs2-server(PINT_state_machine_next+0x81) [0x43c2a0]
> > Sep 21 07:42:54 node1 PVFS2: [E]      [bt]
> > /usr/sbin//pvfs2-server(PINT_state_machine_continue+0x1d) [0x43c4af]
> > Sep 21 07:42:54 node1 PVFS2: [E]      [bt]
> > /usr/sbin//pvfs2-server(main+0x54f) [0x410ef2]
> > Sep 21 07:42:54 node1 PVFS2: [E]      [bt]
> > /lib64/libc.so.6(__libc_start_main+0xf4) [0x368ae1d994]
> > Sep 21 07:42:54 node1 PVFS2: [E]      [bt] /usr/sbin//pvfs2-server
> > [0x4108e9]
> >
> > The GDB backtrace looks like this:
> >
> > #0  0x000000368ae30265 in raise () from /lib64/libc.so.6
> > #1  0x000000368ae31d10 in abort () from /lib64/libc.so.6
> > #2  0x000000368ae296e6 in __assert_fail () from /lib64/libc.so.6
> > #3  0x000000000043c2b9 in PINT_state_machine_next (smcb=0x15d49f10,
> > r=0x15c359e0) at ../pvfs2_src/src/common/misc/state-machine-fns.c:246
> > #4  0x000000000043c4af in PINT_state_machine_continue (smcb=0x15d49f10,
> > r=0x15c359e0) at ../pvfs2_src/src/common/misc/state-machine-fns.c:327
> > #5  0x0000000000410ef2 in main (argc=6, argv=0x7fffb3baa138) at
> > ../pvfs2_src/src/server/pvfs2-server.c:413
> >
> > It looks like an assert is failing in this block of state-machine-fns.c:
> >
> > if (!smcb->current_state || !smcb->current_state->trtbl)
> > {
> >     gossip_err("SM current state or trtbl is invalid "
> >            "(smcb = %p)\n", smcb);
> >     gossip_backtrace();
> >     assert(0);
> >     return -1;
> > }
> >
> > Here is some GDB output of the relevant variables in the if test:
> >
> > (gdb) print smcb->current_state
> > $1 = (struct PINT_state_s *) 0x6d1990
> > (gdb) print smcb->current_state->trtbl
> > $2 = (struct PINT_tran_tbl_s *) 0x0
> >
> > This particular server has segfaulted at least 3 times with the same logs
> > and core data, so it does not seem to be a fluke. Does anyone recognize
> > this
> > error or know why that translation table might be empty?
> >
> > Bart.
> > _______________________________________________
> > Pvfs2-developers mailing list
> > [email protected]
> > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
> >
>
>
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to