Hello,
I'm new on the list but I have a difficult problem I suppose.
I maintain a cluster which used to work nicely.
After a reboot, necessary due to the instalation of an UPS, I cannot start
gmond on the server even if on the nodes is working well.
I have Scientific Linux SL release 5.3 (Boron) installed on all machines.
I have the following kernel on all the nodes:
[root@cn-smpi sbin]# uname -a
Linux cn-smpi.itim-cj.ro 2.6.18-194.8.1.el5 #1 SMP Thu Jul 1 16:05:53 EDT 2010
x86_64 x86_64 x86_64 GNU/Linux
Below you can see some details:
root@cn-smpi sbin]# /sbin/service gmond start
Starting GANGLIA gmond: [ OK ]
[root@cn-smpi sbin]# /sbin/service gmond status
gmond dead but subsys locked
On all nodes is working:
[root@cn-smpi sbin]# cexec /sbin/service gmond status
* cn-sge-1 *
- cn-mpi01-
gmond (pid 13623) is running...
- cn-mpi02-
gmond (pid 11971) is running...
[root@cn-smpi log]# grep segfault messages
Feb 18 12:42:37 cn-smpi kernel: gmond[14493]: segfault at 0008 rip
003c2d0b54dd rsp 7de116d8 error 4
Feb 18 12:51:18 cn-smpi kernel: gmond[14952]: segfault at 0008 rip
003c2d0b54dd rsp 7fff393f13e8 error 4
As I can see the error is at rip 003c2d0b54dd which is stable.
I'll do a debug to see where is the problem:
[root@cn-smpi cfloare]# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 134143
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 134143
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
[root@cn-smpi cfloare]# ulimit -c unlimited
[root@cn-smpi cfloare]# ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 134143
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 134143
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
[root@cn-smpi cfloare]#
[root@cn-smpi sbin]# ./gmond
[root@cn-smpi sbin]# ls co
convertquota core.14952 cossdump
[root@cn-smpi sbin]# ll core*
-rw--- 1 root root 2457600 Feb 18 12:51 core.14952
[root@cn-smpi sbin]# date
Fri Feb 18 12:52:02 EET 2011
[root@cn-smpi sbin]# gdb --core=./core.14952 ./gmond
GNU gdb Fedora (6.8-27.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type show copying
and show warranty for details.
This GDB was configured as x86_64-redhat-linux-gnu...
Reading symbols from /lib64/libresolv.so.2...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /usr/lib64/libganglia-3.1.7.so.0...done.
Loaded symbols for /usr/lib64/libganglia-3.1.7.so.0
Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libnsl.so.1...done.
Loaded symbols for /lib64/libnsl.so.1
Reading symbols from /lib64/libpcre.so.0...done.
Loaded symbols for /lib64/libpcre.so.0
Reading symbols from /lib64/libexpat.so.0...done.
Loaded symbols for /lib64/libexpat.so.0
Reading symbols from /usr/lib64/libconfuse.so.0...done.
Loaded symbols for /usr/lib64/libconfuse.so.0
Reading symbols from /usr/lib64/libapr-1.so.0...done.
Loaded symbols for /usr/lib64/libapr-1.so.0
Reading symbols from /lib64/libpthread.so.0...done.
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libuuid.so.1...done.
Loaded symbols for /lib64/libuuid.so.1
Reading symbols from /lib64/libcrypt.so.1...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /usr/lib64/ganglia/modcpu.so...done.
Loaded symbols for /usr/lib64/ganglia/modcpu.so
Reading symbols from /usr/lib64/ganglia/moddisk.so...done.
Loaded symbols for /usr/lib64/ganglia/moddisk.so
Reading symbols from /usr/lib64/ganglia/modload.so...done.
Loaded symbols for /usr/lib64/ganglia/modload.so
Reading symbols from /usr/lib64/ganglia/modmem.so...done.
Loaded symbols for /usr/lib64/ganglia/modmem.so
Reading symbols from /usr/lib64/ganglia/modnet.so...done.