Hello,

 I'm new on the list but I have a difficult problem I suppose.
 I maintain a cluster which used to work nicely.

 After a reboot, necessary due to the instalation of an UPS, I cannot start 
gmond on the server even if on the nodes is working well.
 I have Scientific Linux SL release 5.3 (Boron) installed on all machines.

 I have the following kernel on all the nodes:
 [root@cn-smpi sbin]# uname -a
 Linux cn-smpi.itim-cj.ro 2.6.18-194.8.1.el5 #1 SMP Thu Jul 1 16:05:53 EDT 2010 
x86_64 x86_64 x86_64 GNU/Linux

 Below you can see some details:

 root@cn-smpi sbin]# /sbin/service gmond start
 Starting GANGLIA gmond: [ OK ]
 [root@cn-smpi sbin]# /sbin/service gmond status
 gmond dead but subsys locked

 On all nodes is working:
 [root@cn-smpi sbin]# cexec /sbin/service gmond status
 ************************* cn-sge-1 *************************
 --------- cn-mpi01---------
 gmond (pid 13623) is running...
 --------- cn-mpi02---------
 gmond (pid 11971) is running...

 [root@cn-smpi log]# grep "segfault" messages
 Feb 18 12:42:37 cn-smpi kernel: gmond[14493]: segfault at 0000000000000008 rip 
0000003c2d0b54dd rsp 00007ffffde116d8 error 4
 Feb 18 12:51:18 cn-smpi kernel: gmond[14952]: segfault at 0000000000000008 rip 
0000003c2d0b54dd rsp 00007fff393f13e8 error 4

 As I can see the error is at rip 0000003c2d0b54dd which is stable.

 I'll do a debug to see where is the problem:

 [root@cn-smpi cfloare]# ulimit -a
 core file size (blocks, -c) 0
 data seg size (kbytes, -d) unlimited
 scheduling priority (-e) 0
 file size (blocks, -f) unlimited
 pending signals (-i) 134143
 max locked memory (kbytes, -l) 32
 max memory size (kbytes, -m) unlimited
 open files (-n) 1024
 pipe size (512 bytes, -p) 8
 POSIX message queues (bytes, -q) 819200
 real-time priority (-r) 0
 stack size (kbytes, -s) 10240
 cpu time (seconds, -t) unlimited
 max user processes (-u) 134143
 virtual memory (kbytes, -v) unlimited
 file locks (-x) unlimited
 [root@cn-smpi cfloare]# ulimit -c unlimited
 [root@cn-smpi cfloare]# ulimit -a
 core file size (blocks, -c) unlimited
 data seg size (kbytes, -d) unlimited
 scheduling priority (-e) 0
 file size (blocks, -f) unlimited
 pending signals (-i) 134143
 max locked memory (kbytes, -l) 32
 max memory size (kbytes, -m) unlimited
 open files (-n) 1024
 pipe size (512 bytes, -p) 8
 POSIX message queues (bytes, -q) 819200
 real-time priority (-r) 0
 stack size (kbytes, -s) 10240
 cpu time (seconds, -t) unlimited
 max user processes (-u) 134143
 virtual memory (kbytes, -v) unlimited
 file locks (-x) unlimited
 [root@cn-smpi cfloare]#


 [root@cn-smpi sbin]# ./gmond

 [root@cn-smpi sbin]# ls co
 convertquota core.14952 cossdump

 [root@cn-smpi sbin]# ll core*
 -rw------- 1 root root 2457600 Feb 18 12:51 core.14952

 [root@cn-smpi sbin]# date
 Fri Feb 18 12:52:02 EET 2011

 [root@cn-smpi sbin]# gdb --core=./core.14952 ./gmond
 GNU gdb Fedora (6.8-27.el5)
 Copyright (C) 2008 Free Software Foundation, Inc.
 License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
 This is free software: you are free to change and redistribute it.
 There is NO WARRANTY, to the extent permitted by law. Type "show copying"
 and "show warranty" for details.
 This GDB was configured as "x86_64-redhat-linux-gnu"...
 Reading symbols from /lib64/libresolv.so.2...done.
 Loaded symbols for /lib64/libresolv.so.2
 Reading symbols from /usr/lib64/libganglia-3.1.7.so.0...done.
 Loaded symbols for /usr/lib64/libganglia-3.1.7.so.0
 Reading symbols from /lib64/libdl.so.2...done.
 Loaded symbols for /lib64/libdl.so.2
 Reading symbols from /lib64/libnsl.so.1...done.
 Loaded symbols for /lib64/libnsl.so.1
 Reading symbols from /lib64/libpcre.so.0...done.
 Loaded symbols for /lib64/libpcre.so.0
 Reading symbols from /lib64/libexpat.so.0...done.
 Loaded symbols for /lib64/libexpat.so.0
 Reading symbols from /usr/lib64/libconfuse.so.0...done.
 Loaded symbols for /usr/lib64/libconfuse.so.0
 Reading symbols from /usr/lib64/libapr-1.so.0...done.
 Loaded symbols for /usr/lib64/libapr-1.so.0
 Reading symbols from /lib64/libpthread.so.0...done.
 Loaded symbols for /lib64/libpthread.so.0
 Reading symbols from /lib64/libc.so.6...done.
 Loaded symbols for /lib64/libc.so.6
 Reading symbols from /lib64/ld-linux-x86-64.so.2...done.
 Loaded symbols for /lib64/ld-linux-x86-64.so.2
 Reading symbols from /lib64/libuuid.so.1...done.
 Loaded symbols for /lib64/libuuid.so.1
 Reading symbols from /lib64/libcrypt.so.1...done.
 Loaded symbols for /lib64/libcrypt.so.1
 Reading symbols from /usr/lib64/ganglia/modcpu.so...done.
 Loaded symbols for /usr/lib64/ganglia/modcpu.so
 Reading symbols from /usr/lib64/ganglia/moddisk.so...done.
 Loaded symbols for /usr/lib64/ganglia/moddisk.so
 Reading symbols from /usr/lib64/ganglia/modload.so...done.
 Loaded symbols for /usr/lib64/ganglia/modload.so
 Reading symbols from /usr/lib64/ganglia/modmem.so...done.
 Loaded symbols for /usr/lib64/ganglia/modmem.so
 Reading symbols from /usr/lib64/ganglia/modnet.so...done.
 Loaded symbols for /usr/lib64/ganglia/modnet.so
 Reading symbols from /usr/lib64/ganglia/modproc.so...done.
 Loaded symbols for /usr/lib64/ganglia/modproc.so
 Reading symbols from /usr/lib64/ganglia/modsys.so...done.
 Loaded symbols for /usr/lib64/ganglia/modsys.so
 Reading symbols from /usr/lib64/ganglia/modpython.so...done.
 Loaded symbols for /usr/lib64/ganglia/modpython.so
 Reading symbols from /usr/lib64/libpython2.4.so.1.0...done.
 Loaded symbols for /usr/lib64/libpython2.4.so.1.0
 Reading symbols from /lib64/libutil.so.1...done.
 Loaded symbols for /lib64/libutil.so.1
 Reading symbols from /lib64/libm.so.6...done.
 Loaded symbols for /lib64/libm.so.6
 Core was generated by `./gmond'.
 Program terminated with signal 11, Segmentation fault.
 [New process 14952]
 #0 0x0000003c2d0b54dd in PySys_GetObject () from /usr/lib64/libpython2.4.so.1.0
 (gdb)
 (gdb) bt
 #0 0x0000003c2d0b54dd in PySys_GetObject () from /usr/lib64/libpython2.4.so.1.0
 #1 0x00002af923058d3a in pyth_metric_init (p=<value optimized out>) at 
mod_python.c:576
 #2 0x0000000000403df3 in setup_metric_callbacks () at gmond.c:1962
 #3 0x0000000000407738 in main (argc=<value optimized out>, 
argv=0x7fff393f1f68) at gmond.c:2842
 (gdb)q



 I can see the error is in the library /usr/lib64/libpython2.4.so.1.0 at the 
same rip address 0x0000003c2d0b54dd as before.

 I check the version of the python on the server and on the nodes:

 [root@cn-smpi sbin]# rpm -qa python
 python-2.4.3-24.el5_3.6.x86_64

 [root@cn-smpi sbin]# cexec rpm -qa python
 ************************* cn-sge-1 *************************
 --------- cn-mpi01---------
 python-2.4.3-43.el5.x86_64
 --------- cn-mpi02---------
 python-2.4.3-43.el5.x86_64

 As I can see there is a difference. I update python on the server:
 [root@cn-smpi sbin]# yum update python
 [root@cn-smpi sbin]# rpm -qa python
 python-2.4.3-43.el5.x86_64


 I tried to start again gmond and:
 [root@cn-smpi log]# grep "segfault" messages
 Feb 18 12:42:37 cn-smpi kernel: gmond[14493]: segfault at 0000000000000008 rip 
0000003c2d0b54dd rsp 00007ffffde116d8 error 4
 Feb 18 12:51:18 cn-smpi kernel: gmond[14952]: segfault at 0000000000000008 rip 
0000003c2d0b54dd rsp 00007fff393f13e8 error 4
 Feb 18 13:34:18 cn-smpi kernel: gmond[17307]: segfault at 0000000000000008 rip 
00002b83b78727ad rsp 00007fff696763f8 error 4


 I get the error in another place at rip 00002b83b78727.

 I debug again:
 root@cn-smpi sbin]# gdb --core=./core.17533 ./gmond
 GNU gdb Fedora (6.8-27.el5)
 ....
 Reading symbols from /usr/lib64/libpython2.4.so.1.0...done.
 Loaded symbols for /usr/lib64/libpython2.4.so.1.0
 Reading symbols from /lib64/libutil.so.1...done.
 Loaded symbols for /lib64/libutil.so.1
 Reading symbols from /lib64/libm.so.6...done.
 Loaded symbols for /lib64/libm.so.6
 Core was generated by `./gmond'.
 Program terminated with signal 11, Segmentation fault.
 [New process 17533]
 #0 0x00002b6ac73047ad in PySys_GetObject () from /usr/lib64/libpython2.4.so.1.0
 (gdb) bt
 #0 0x00002b6ac73047ad in PySys_GetObject () from /usr/lib64/libpython2.4.so.1.0
 #1 0x00002b6ac704ad3a in pyth_metric_init (p=<value optimized out>) at 
mod_python.c:576
 #2 0x0000000000403df3 in setup_metric_callbacks () at gmond.c:1962
 #3 0x0000000000407738 in main (argc=<value optimized out>, 
argv=0x7fff230a9608) at gmond.c:2842
 (gdb) q

 Now the error is in a different place at rip 00002b6ac73047ad

 I tried to start gmond again using /sbin/service gmond start,
 after I deleted the gmond file in /var/lock/subsys folder
 and crashed again:

 [root@cn-smpi sbin]# grep "segfault" /var/log/messages
 Feb 18 12:42:37 cn-smpi kernel: gmond[14493]: segfault at 0000000000000008 rip 
0000003c2d0b54dd rsp 00007ffffde116d8 error 4
 Feb 18 12:51:18 cn-smpi kernel: gmond[14952]: segfault at 0000000000000008 rip 
0000003c2d0b54dd rsp 00007fff393f13e8 error 4
 Feb 18 13:34:18 cn-smpi kernel: gmond[17307]: segfault at 0000000000000008 rip 
00002b83b78727ad rsp 00007fff696763f8 error 4
 Feb 18 13:38:05 cn-smpi kernel: gmond[17533]: segfault at 0000000000000008 rip 
00002b6ac73047ad rsp 00007fff230a8a88 error 4
 Feb 18 13:46:40 cn-smpi kernel: gmond[17948]: segfault at 0000000000000008 rip 
00002ac563d7b7ad rsp 00007fff11011d98 error 4


 As you can see, if before updating python the rip was stable, now it's 
fluctuating ... and the error seems to be again in 
/usr/lib64/libpython2.4.so.1.0

 I changed the debug_level in /usr/ganglia/gmond.conf to 100 and I tried to 
start it again:
 [root@cn-smpi ganglia]# /sbin/service gmond start
 Starting GANGLIA gmond: loaded module: core_metrics
 loaded module: cpu_module
 loaded module: disk_module
 loaded module: load_module
 loaded module: mem_module
 loaded module: net_module
 loaded module: proc_module
 loaded module: sys_module
 loaded module: python_module
 loaded module: python_module
 /bin/bash: line 1: 20041 Segmentation fault /usr/sbin/gmond
 [FAILED]

 And running gmond in the debug mode produced:
 [root@cn-smpi sbin]# ./gmond -d 3
 loaded module: core_metrics
 loaded module: cpu_module
 loaded module: disk_module
 loaded module: load_module
 loaded module: mem_module
 loaded module: net_module
 loaded module: proc_module
 loaded module: sys_module
 loaded module: python_module
 loaded module: python_module
 Segmentation fault (core dumped)

 As you know, the shared library depedencies are:
 [root@cn-smpi sbin]# ldd /usr/sbin/gmond
 libresolv.so.2 => /lib64/libresolv.so.2 (0x0000003c30000000)
 libganglia-3.1.7.so.0 => /usr/lib64/libganglia-3.1.7.so.0 (0x00002b42fb403000)
 libdl.so.2 => /lib64/libdl.so.2 (0x0000003c2c400000)
 libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003c2f000000)
 libpcre.so.0 => /lib64/libpcre.so.0 (0x0000003c32c00000)
 libexpat.so.0 => /lib64/libexpat.so.0 (0x0000003c31c00000)
 libconfuse.so.0 => /usr/lib64/libconfuse.so.0 (0x0000003c2dc00000)
 libapr-1.so.0 => /usr/lib64/libapr-1.so.0 (0x0000003c2d800000)
 libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003c2cc00000)
 libc.so.6 => /lib64/libc.so.6 (0x0000003c2c000000)
 /lib64/ld-linux-x86-64.so.2 (0x0000003c2bc00000)
 libuuid.so.1 => /lib64/libuuid.so.1 (0x0000003c34800000)
 libcrypt.so.1 => /lib64/libcrypt.so.1 (0x0000003c2f800000)

 If somebody understand what's happening please help me.
 I enjoy a lot ganglia and used to work nicely. I don't understand why there is 
this error.

 Thank you very much,
 Calin
------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to