On Mar 26, 2009, at 3:43 PM, Rene Salmon wrote:
HI,
We are getting some strange behavior out of pvfs-2.8.1 clients running
on some sles 10 sp 1 nodes.
The pvfs2 clients can mount the pvfs2 file system with no problems we
then start an MPI job that runs on a small number of nodes. The
problem
happens when we try to kill the mpi job. As soon as we send the kill
signal to the mpi job several of our pvfs2 client nodes have their
pvfs2-client-core deamon die with this message:
hpcp6671:~ # ps -ef |grep pvfs
root 25767 1 0 12:21 ?
00:00:00 /bphpc5/vol0/salmr0/opt/pvfs-2.8.1/x86_64/sles10sp1/sbin/
pvfs2-client -p /bphpc5/vol0/salmr0/opt/pvfs-2.8.1/x86_64/sles10sp1/
sbin/pvfs2-client-core
root 16117 25767 0 15:02 ? 00:00:00 [pvfs2-client-co]
hpcp6671:~ # cat /tmp/pvfs2-client.log
[E 12:21:35.567169] PVFS Client Daemon Started. Version 2.8.1
[D 12:21:35.567434] [INFO]: Mapping pointer 0x2acdf7aa3000 for I/O.
[D 12:21:35.579256] [INFO]: Mapping pointer 0x2acdf8ea5000 for I/O.
[E 15:02:54.988860] PVFS2 client: signal 11, faulty address is 0x41d5,
from 0x408d81
[E 15:02:54.989282] [bt] pvfs2-client-core [0x408d81]
[E 15:02:54.989294] [bt] pvfs2-client-core [0x408d81]
[E 15:02:54.989302] [bt] pvfs2-client-core(main+0xbc3) [0x40a173]
[E 15:02:54.989309] [bt] /lib64/libc.so.6(__libc_start_main+0xf4)
[0x2acdf788b154]
[E 15:02:54.989315] [bt] pvfs2-client-core [0x403519]
[E 15:02:54.991351] Child process with pid 25768 was killed by an
uncaught signal 6
Hi Rene,
This is a segfault in the client process. The daemon is restarting
itself, which may be what the error below is from. I'll have to
figure out what that 0x408d81 pointer maps to. Might not be all that
useful though. Would you be willing to recompile with debugging
enabled (rerun configure with CFLAGS=-g, and then rebuild)? That
would at least give us line numbers to look at.
[E 15:02:54.993980] PVFS Client Daemon Started. Version 2.8.1
[D 15:02:54.994242] [INFO]: Mapping pointer 0x2b94619a2000 for I/O.
[D 15:02:55.008318] [INFO]: Mapping pointer 0x2b9462da4000 for I/O.
[E 15:02:55.312456] Got an unrecognized/unimplemented vfs operation of
type ff000000.
[E 15:02:55.312497] Post of op: PVFS_VFS_OP_INVALID failed!
I would try to fix the above before worrying about this one. It could
be just fallout from the first failure.
-sam
Any ideas?
thanks
Rene
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users