Hi Sam, Thanks for the reply. I think we found a work around. Our code is really hybrid meaning MPI/OpenMP. MPI across nodes and OpenMP inside each node. Each node has several I/O threads writing to pvfs. We usually send a kill signal to the MPI rank that started the openmp threads on the node and that somehow sometimes does not kill all the I/O threads and leaves some hanging and pvfs-core dies.
If we use pkill instead of just kill pkill seems to kill all the I/O thread every time and things work as expected. At the moment I can't recompile with debug on but will try that out later. Thanks Rene > > > HI, > > > > We are getting some strange behavior out of pvfs-2.8.1 clients > running > > on some sles 10 sp 1 nodes. > > > > The pvfs2 clients can mount the pvfs2 file system with no problems > we > > then start an MPI job that runs on a small number of nodes. The > > problem > > happens when we try to kill the mpi job. As soon as we send the > kill > > signal to the mpi job several of our pvfs2 client nodes have their > > pvfs2-client-core deamon die with this message: > > > > hpcp6671:~ # ps -ef |grep pvfs > > root 25767 1 0 12:21 ? > > 00:00:00 /bphpc5/vol0/salmr0/opt/pvfs-2.8.1/x86_64/sles10sp1/sbin/ > > pvfs2-client -p /bphpc5/vol0/salmr0/opt/pvfs-2.8.1/x86_64/sles10sp1/ > > sbin/pvfs2-client-core > > root 16117 25767 0 15:02 ? 00:00:00 [pvfs2-client-co] > > > > > > > > hpcp6671:~ # cat /tmp/pvfs2-client.log > > [E 12:21:35.567169] PVFS Client Daemon Started. Version 2.8.1 > > [D 12:21:35.567434] [INFO]: Mapping pointer 0x2acdf7aa3000 for I/O. > > [D 12:21:35.579256] [INFO]: Mapping pointer 0x2acdf8ea5000 for I/O. > > [E 15:02:54.988860] PVFS2 client: signal 11, faulty address is > 0x41d5, > > from 0x408d81 > > [E 15:02:54.989282] [bt] pvfs2-client-core [0x408d81] > > [E 15:02:54.989294] [bt] pvfs2-client-core [0x408d81] > > [E 15:02:54.989302] [bt] pvfs2-client-core(main+0xbc3) [0x40a173] > > [E 15:02:54.989309] [bt] /lib64/libc.so.6(__libc_start_main+0xf4) > > [0x2acdf788b154] > > [E 15:02:54.989315] [bt] pvfs2-client-core [0x403519] > > [E 15:02:54.991351] Child process with pid 25768 was killed by an > > uncaught signal 6 > > Hi Rene, > > This is a segfault in the client process. The daemon is restarting > itself, which may be what the error below is from. I'll have to > figure out what that 0x408d81 pointer maps to. Might not be all that > useful though. Would you be willing to recompile with debugging > enabled (rerun configure with CFLAGS=-g, and then rebuild)? That > would at least give us line numbers to look at. > > > > > [E 15:02:54.993980] PVFS Client Daemon Started. Version 2.8.1 > > [D 15:02:54.994242] [INFO]: Mapping pointer 0x2b94619a2000 for I/O. > > [D 15:02:55.008318] [INFO]: Mapping pointer 0x2b9462da4000 for I/O. > > [E 15:02:55.312456] Got an unrecognized/unimplemented vfs operation > of > > type ff000000. > > [E 15:02:55.312497] Post of op: PVFS_VFS_OP_INVALID failed! > > I would try to fix the above before worrying about this one. It > could > be just fallout from the first failure. > > -sam > > > > > > > > Any ideas? > > > > thanks > > Rene > > > > _______________________________________________ > > Pvfs2-users mailing list > > [email protected] > > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users > > > _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
