Hello, I'm new on the list but I have a difficult problem I suppose. I maintain
a cluster which used to work nicely. After a reboot, necessary due to the
instalation of an UPS, I cannot start gmond on the server even if on the nodes
is working well. I have Scientific Linux SL release 5.3 (Boron) installed on
all machines. I have the following kernel on all the nodes: [ root@cn-smpi
sbin]# uname -a Linux cn-smpi.itim-cj.ro 2.6.18-194.8.1.el5 #1 SMP Thu Jul 1
16:05:53 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux Below you can see some
details: root@cn-smpi sbin]# /sbin/service gmond start Starting GANGLIA
gmond: [ OK ] [ root@cn-smpi sbin]# /sbin/service gmond status gmond dead but
subsys locked On all nodes is working: [ root@cn-smpi sbin]# cexec
/sbin/service gmond status ************************* cn-sge-1
************************* --------- cn-mpi01--------- gmond (pid 13623) is
running... --------- cn-mpi02--------- gmond (pid 11971) is running... [
root@cn-smpi log]# grep "segfault" messages Feb 18 12:42:37 cn-smpi kernel:
gmond[14493]: segfault at 0000000000000008 rip 0000003c2d0b54dd rsp
00007ffffde116d8 error 4 Feb 18 12:51:18 cn-smpi kernel: gmond[14952]: segfault
at 0000000000000008 rip 0000003c2d0b54dd rsp 00007fff393f13e8 error 4 As I can
see the error is at rip 0000003c2d0b54dd which is stable. I'll do a debug to
see where is the problem: [ root@cn-smpi cfloare]# ulimit -a core file size
(blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0
file size (blocks, -f) unlimited pending signals (-i) 134143 max locked memory
(kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 1024
pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time
priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited
max user processes (-u) 134143 virtual memory (kbytes, -v) unlimited file locks
(-x) unlimited [ root@cn-smpi cfloare]# ulimit -c unlimited [ root@cn-smpi
cfloare]# ulimit -a core file size (blocks, -c) unlimited data seg size
(kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f)
unlimited pending signals (-i) 134143 max locked memory (kbytes, -l) 32 max
memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes,
-p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack
size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes
(-u) 134143 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited [
root@cn-smpi cfloare]# [ root@cn-smpi sbin]# ./gmond [ root@cn-smpi sbin]#
ls co convertquota core.14952 cossdump [ root@cn-smpi sbin]# ll core*
-rw------- 1 root root 2457600 Feb 18 12:51 core.14952 [ root@cn-smpi sbin]#
date Fri Feb 18 12:52:02 EET 2011 [ root@cn-smpi sbin]# gdb
--core=./core.14952 ./gmond GNU gdb Fedora (6.8-27.el5) Copyright (C) 2008 Free
Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later
<http://gnu.org/licenses/gpl.html > This is free software: you are free to
change and redistribute it. There is NO WARRANTY, to the extent permitted by
law. Type "show copying" and "show warranty" for details. This GDB was
configured as "x86_64-redhat-linux-gnu"... Reading symbols from
/lib64/libresolv.so.2...done. Loaded symbols for /lib64/libresolv.so.2 Reading
symbols from /usr/lib64/libganglia-3.1.7.so.0...done. Loaded symbols for
/usr/lib64/libganglia-3.1.7.so.0 Reading symbols from /lib64/libdl.so.2...done.
Loaded symbols for /lib64/libdl.so.2 Reading symbols from
/lib64/libnsl.so.1...done. Loaded symbols for /lib64/libnsl.so.1 Reading
symbols from /lib64/libpcre.so.0...done. Loaded symbols for /lib64/libpcre.so.0
Reading symbols from /lib64/libexpat.so.0...done. Loaded symbols for
/lib64/libexpat.so.0 Reading symbols from /usr/lib64/libconfuse.so.0...done.
Loaded symbols for /usr/lib64/libconfuse.so.0 Reading symbols from
/usr/lib64/libapr-1.so.0...done. Loaded symbols for /usr/lib64/libapr-1.so.0
Reading symbols from /lib64/libpthread.so.0...done. Loaded symbols for
/lib64/libpthread.so.0 Reading symbols from /lib64/libc.so.6...done. Loaded
symbols for /lib64/libc.so.6 Reading symbols from
/lib64/ld-linux-x86-64.so.2...done. Loaded symbols for
/lib64/ld-linux-x86-64.so.2 Reading symbols from /lib64/libuuid.so.1...done.
Loaded symbols for /lib64/libuuid.so.1 Reading symbols from
/lib64/libcrypt.so.1...done. Loaded symbols for /lib64/libcrypt.so.1 Reading
symbols from /usr/lib64/ganglia/modcpu.so...done. Loaded symbols for
/usr/lib64/ganglia/modcpu.so Reading symbols from
/usr/lib64/ganglia/moddisk.so...done. Loaded symbols for
/usr/lib64/ganglia/moddisk.so Reading symbols from
/usr/lib64/ganglia/modload.so...done. Loaded symbols for
/usr/lib64/ganglia/modload.so Reading symbols from
/usr/lib64/ganglia/modmem.so...done. Loaded symbols for
/usr/lib64/ganglia/modmem.so Reading symbols from
/usr/lib64/ganglia/modnet.so...done. Loaded symbols for
/usr/lib64/ganglia/modnet.so Reading symbols from
/usr/lib64/ganglia/modproc.so...done. Loaded symbols for
/usr/lib64/ganglia/modproc.so Reading symbols from
/usr/lib64/ganglia/modsys.so...done. Loaded symbols for
/usr/lib64/ganglia/modsys.so Reading symbols from
/usr/lib64/ganglia/modpython.so...done. Loaded symbols for
/usr/lib64/ganglia/modpython.so Reading symbols from
/usr/lib64/libpython2.4.so.1.0...done. Loaded symbols for
/usr/lib64/libpython2.4.so.1.0 Reading symbols from /lib64/libutil.so.1...done.
Loaded symbols for /lib64/libutil.so.1 Reading symbols from
/lib64/libm.so.6...done. Loaded symbols for /lib64/libm.so.6 Core was generated
by `./gmond'. Program terminated with signal 11, Segmentation fault. [New
process 14952] #0 0x0000003c2d0b54dd in PySys_GetObject () from
/usr/lib64/libpython2.4.so.1.0 (gdb) (gdb) bt #0 0x0000003c2d0b54dd in
PySys_GetObject () from /usr/lib64/libpython2.4.so.1.0 #1 0x00002af923058d3a in
pyth_metric_init (p=<value optimized out>) at mod_python.c:576 #2
0x0000000000403df3 in setup_metric_callbacks () at gmond.c:1962 #3
0x0000000000407738 in main (argc=<value optimized out>, argv=0x7fff393f1f68) at
gmond.c:2842 (gdb)q I can see the error is in the library
/usr/lib64/libpython2.4.so.1.0 at the same rip address 0x0000003c2d0b54dd as
before. I check the version of the python on the server and on the nodes: [
root@cn-smpi sbin]# rpm -qa python python-2.4.3-24.el5_3.6.x86_64 [
root@cn-smpi sbin]# cexec rpm -qa python ************************* cn-sge-1
************************* --------- cn-mpi01---------
python-2.4.3-43.el5.x86_64 --------- cn-mpi02---------
python-2.4.3-43.el5.x86_64 As I can see there is a difference. I update python
on the server: [ root@cn-smpi sbin]# yum update python [ root@cn-smpi sbin]#
rpm -qa python python-2.4.3-43.el5.x86_64 I tried to start again gmond and: [
root@cn-smpi log]# grep "segfault" messages Feb 18 12:42:37 cn-smpi kernel:
gmond[14493]: segfault at 0000000000000008 rip 0000003c2d0b54dd rsp
00007ffffde116d8 error 4 Feb 18 12:51:18 cn-smpi kernel: gmond[14952]: segfault
at 0000000000000008 rip 0000003c2d0b54dd rsp 00007fff393f13e8 error 4 Feb 18
13:34:18 cn-smpi kernel: gmond[17307]: segfault at 0000000000000008 rip
00002b83b78727ad rsp 00007fff696763f8 error 4 I get the error in another place
at rip 00002b83b78727. I debug again: root@cn-smpi sbin]# gdb
--core=./core.17533 ./gmond GNU gdb Fedora (6.8-27.el5) .... Reading symbols
from /usr/lib64/libpython2.4.so.1.0...done. Loaded symbols for
/usr/lib64/libpython2.4.so.1.0 Reading symbols from /lib64/libutil.so.1...done.
Loaded symbols for /lib64/libutil.so.1 Reading symbols from
/lib64/libm.so.6...done. Loaded symbols for /lib64/libm.so.6 Core was generated
by `./gmond'. Program terminated with signal 11, Segmentation fault. [New
process 17533] #0 0x00002b6ac73047ad in PySys_GetObject () from
/usr/lib64/libpython2.4.so.1.0 (gdb) bt #0 0x00002b6ac73047ad in
PySys_GetObject () from /usr/lib64/libpython2.4.so.1.0 #1 0x00002b6ac704ad3a in
pyth_metric_init (p=<value optimized out>) at mod_python.c:576 #2
0x0000000000403df3 in setup_metric_callbacks () at gmond.c:1962 #3
0x0000000000407738 in main (argc=<value optimized out>, argv=0x7fff230a9608) at
gmond.c:2842 (gdb) q Now the error is in a different place at rip
00002b6ac73047ad I tried to start gmond again using /sbin/service gmond start,
after I deleted the gmond file in /var/lock/subsys folder and crashed again: [
root@cn-smpi sbin]# grep "segfault" /var/log/messages Feb 18 12:42:37 cn-smpi
kernel: gmond[14493]: segfault at 0000000000000008 rip 0000003c2d0b54dd rsp
00007ffffde116d8 error 4 Feb 18 12:51:18 cn-smpi kernel: gmond[14952]: segfault
at 0000000000000008 rip 0000003c2d0b54dd rsp 00007fff393f13e8 error 4 Feb 18
13:34:18 cn-smpi kernel: gmond[17307]: segfault at 0000000000000008 rip
00002b83b78727ad rsp 00007fff696763f8 error 4 Feb 18 13:38:05 cn-smpi kernel:
gmond[17533]: segfault at 0000000000000008 rip 00002b6ac73047ad rsp
00007fff230a8a88 error 4 Feb 18 13:46:40 cn-smpi kernel: gmond[17948]: segfault
at 0000000000000008 rip 00002ac563d7b7ad rsp 00007fff11011d98 error 4 As you
can see, if before updating python the rip was stable, now it's fluctuating ...
and the error seems to be again in /usr/lib64/libpython2.4.so.1.0 I changed the
debug_level in /usr/ganglia/gmond.conf to 100 and I tried to start it again: [
root@cn-smpi ganglia]# /sbin/service gmond start Starting GANGLIA gmond:
loaded module: core_metrics loaded module: cpu_module loaded module:
disk_module loaded module: load_module loaded module: mem_module loaded module:
net_module loaded module: proc_module loaded module: sys_module loaded module:
python_module loaded module: python_module /bin/bash: line 1: 20041
Segmentation fault /usr/sbin/gmond [FAILED] And running gmond in the debug mode
produced: [ root@cn-smpi sbin]# ./gmond -d 3 loaded module: core_metrics
loaded module: cpu_module loaded module: disk_module loaded module: load_module
loaded module: mem_module loaded module: net_module loaded module: proc_module
loaded module: sys_module loaded module: python_module loaded module:
python_module Segmentation fault (core dumped) As you know, the shared library
depedencies are: [ root@cn-smpi sbin]# ldd /usr/sbin/gmond libresolv.so.2 =>
/lib64/libresolv.so.2 (0x0000003c30000000) libganglia-3.1.7.so.0 =>
/usr/lib64/libganglia-3.1.7.so.0 (0x00002b42fb403000) libdl.so.2 =>
/lib64/libdl.so.2 (0x0000003c2c400000) libnsl.so.1 => /lib64/libnsl.so.1
(0x0000003c2f000000) libpcre.so.0 => /lib64/libpcre.so.0 (0x0000003c32c00000)
libexpat.so.0 => /lib64/libexpat.so.0 (0x0000003c31c00000) libconfuse.so.0 =>
/usr/lib64/libconfuse.so.0 (0x0000003c2dc00000) libapr-1.so.0 =>
/usr/lib64/libapr-1.so.0 (0x0000003c2d800000) libpthread.so.0 =>
/lib64/libpthread.so.0 (0x0000003c2cc00000) libc.so.6 => /lib64/libc.so.6
(0x0000003c2c000000) /lib64/ld-linux-x86-64.so.2 (0x0000003c2bc00000)
libuuid.so.1 => /lib64/libuuid.so.1 (0x0000003c34800000) libcrypt.so.1 =>
/lib64/libcrypt.so.1 (0x0000003c2f800000) If somebody understand what's
happening please help me. I enjoy a lot ganglia and used to work nicely. I
don't understand why there is this error. Thank you very much, Calin
------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general