And, your log file is gonna get really big, too. I would try recompiling everything and restarting the servers and clients....just to be sure. Otherwise, I'm not sure. I've never seen that particular problem before; my guess is just a wild guess!
Becky -- Becky Ligon PVFS Developer Clemson University 864-650-4065 > Hey Becky, > > Thanks for the recommendation. I turned on the debugging, but it could be > days or longer before another segfault occurs. The three errors I have > seen > occurred over the course of about 5 weeks. > > Is there any way I could test either of those scenarios you mentioned or > otherwise narrow it down? I would really like to get this figured out > before > any other file systems start experiencing problems. > > Thanks, > Bart. > > > On Tue, Sep 21, 2010 at 2:39 PM, Becky Ligon <[email protected]> wrote: > >> Bart: >> >> Try turning on the state machine gossip debug statements, so you can see >> which request is trying to be processed: ...bin/pvfs2-set-debugmask -m >> <mountpoint> sm, while the servers are running. >> >> My guess is that the pvfs lib has gotten corrupted somehow or there is >> some sort of mismatch between the client request that is sent to the >> server. >> >> Becky >> -- >> Becky Ligon >> PVFS Developer >> Clemson University >> 864-650-4065 >> >> > Hey guys, >> > >> > I am running into some server segfaults with the latest release. I >> have a >> > single RHEL 5.5 64bit virtual machine with 4GB memory running four >> daemons >> > using network attached storage. This is 2.8.2 plus a couple of recent >> > patches. >> > >> > The log output looks like this: >> > >> > Sep 21 07:42:54 node1 PVFS2: [E] SM current state or trtbl is invalid >> > (smcb >> > = 0x11d9a390) >> > Sep 21 07:42:54 node1 PVFS2: [E] [bt] >> > /usr/sbin//pvfs2-server(PINT_state_machine_next+0x81) [0x43c2a0] >> > Sep 21 07:42:54 node1 PVFS2: [E] [bt] >> > /usr/sbin//pvfs2-server(PINT_state_machine_continue+0x1d) [0x43c4af] >> > Sep 21 07:42:54 node1 PVFS2: [E] [bt] >> > /usr/sbin//pvfs2-server(main+0x54f) [0x410ef2] >> > Sep 21 07:42:54 node1 PVFS2: [E] [bt] >> > /lib64/libc.so.6(__libc_start_main+0xf4) [0x368ae1d994] >> > Sep 21 07:42:54 node1 PVFS2: [E] [bt] /usr/sbin//pvfs2-server >> > [0x4108e9] >> > >> > The GDB backtrace looks like this: >> > >> > #0 0x000000368ae30265 in raise () from /lib64/libc.so.6 >> > #1 0x000000368ae31d10 in abort () from /lib64/libc.so.6 >> > #2 0x000000368ae296e6 in __assert_fail () from /lib64/libc.so.6 >> > #3 0x000000000043c2b9 in PINT_state_machine_next (smcb=0x15d49f10, >> > r=0x15c359e0) at ../pvfs2_src/src/common/misc/state-machine-fns.c:246 >> > #4 0x000000000043c4af in PINT_state_machine_continue >> (smcb=0x15d49f10, >> > r=0x15c359e0) at ../pvfs2_src/src/common/misc/state-machine-fns.c:327 >> > #5 0x0000000000410ef2 in main (argc=6, argv=0x7fffb3baa138) at >> > ../pvfs2_src/src/server/pvfs2-server.c:413 >> > >> > It looks like an assert is failing in this block of >> state-machine-fns.c: >> > >> > if (!smcb->current_state || !smcb->current_state->trtbl) >> > { >> > gossip_err("SM current state or trtbl is invalid " >> > "(smcb = %p)\n", smcb); >> > gossip_backtrace(); >> > assert(0); >> > return -1; >> > } >> > >> > Here is some GDB output of the relevant variables in the if test: >> > >> > (gdb) print smcb->current_state >> > $1 = (struct PINT_state_s *) 0x6d1990 >> > (gdb) print smcb->current_state->trtbl >> > $2 = (struct PINT_tran_tbl_s *) 0x0 >> > >> > This particular server has segfaulted at least 3 times with the same >> logs >> > and core data, so it does not seem to be a fluke. Does anyone >> recognize >> > this >> > error or know why that translation table might be empty? >> > >> > Bart. >> > _______________________________________________ >> > Pvfs2-developers mailing list >> > [email protected] >> > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers >> > >> >> > _______________________________________________ Pvfs2-developers mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
