Hi, I am working on a Cluster with 752 Nodes at the University of Freiburg (Germany).
In this setup gmetad is crashing with a segfault.
/var/log/messages:
Aug 16 20:07:47 monitor kernel: gmetad[38792]: segfault at 0 ip
00007f9b5122d82c sp 00007f9b38a31af0 error 4 in
libganglia.so.0.0.0[7f9b51222000+14000]
Aug 16 20:07:47 monitor systemd: gmetad.service: main process exited,
code=killed, status=11/SEGV
System: CentOS Linux release 7.2.1511
Ganglia-Versions:
ganglia-web-3.7.1-2.el7.x86_64
ganglia-3.7.2-2.el7.x86_64
ganglia-gmond-3.7.2-2.el7.x86_64
ganglia-debuginfo-3.7.2-2.el7.x86_64
ganglia-gmetad-3.7.2-2.el7.x86_64
The ganglia configuration files are attached to this email.
The crash always happens when gmetad removes nodes that disappeared:
$ journalctl -u gmetad.service
...
Aug 23 16:58:30 monitor gmetad: Updating host n4262.nemo.privat, metric
disk_total
Aug 23 16:58:30 monitor gmetad: Updating host n4262.nemo.privat, metric
mem_shared
Aug 23 16:59:00 monitor journal: Suppressed 48289 messages from
/system.slice/gmetad.service
Aug 23 16:59:00 monitor gmetad: Cleanup thread running...
Aug 23 16:59:00 monitor gmetad: Cleanup deleting host "n4385.nemo.privat"
Aug 23 16:59:00 monitor gmetad: Cleanup deleting host "n4385.nemo.privat"
Aug 23 16:59:00 monitor kernel: gmetad[68347]: segfault at 160 ip
00007ffb0e9de9a6 sp 00007ffaf6b3da60 error 4 in
libganglia.so.0.0.0[7ffb0e9d2000+16000]
Aug 23 16:59:00 monitor systemd: gmetad.service: main process exited,
code=killed, status=11/SEGV
Aug 23 16:59:00 monitor systemd: Unit gmetad.service entered failed state.
I started gmetad with gdb to get more information:
gdb /usr/sbin/gmetad
(gdb) run -d 10 -c /etc/ganglia/gmetad.conf
The crash looks like this:
...
Writing Root Summary data for metric mem_shared
Writing Root Summary data for metric proc_run
Cleanup thread running...
Cleanup deleting host "n4527.nemo.privat"
Cleanup deleting host "n4527.nemo.privat"
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdfb00700 (LWP 175149)]
0x00007ffff799f82c in hash_key (seed=0, len=<optimized out>, key=<optimized
out>)
at hash.c:182
182 seed ^= (uint64_t)*bp++;
Additional gdb information:
(gdb) where
#0 0x00007ffff799f82c in hash_key (seed=0, len=<optimized out>, key=<optimized
out>)
at hash.c:182
#1 hashval (hash=0x7fffd9a5f290, key=<optimized out>) at hash.c:195
#2 hash_delete (key=<optimized out>, hash=hash@entry=0x7fffd9a5f290) at
hash.c:335
#3 0x00007ffff799f927 in hash_destroy (hash=0x7fffd9a5f290) at hash.c:145
#4 0x000000000040f35c in cleanup_source (key=0x7fffd92140d0,
val=0x7fffd92143d0,
arg=0x7fffdfaffc10) at cleanup.c:170
#5 0x00007ffff799f9d9 in hash_walkfrom (hash=0x62ffc0, from=<optimized out>,
func=0x40f219 <cleanup_source>, arg=0x7fffdfaffc10) at hash.c:402
#6 0x000000000040f50b in cleanup_thread (arg=0x0) at cleanup.c:206
#7 0x00007ffff635ddc5 in start_thread (arg=0x7fffdfb00700) at
pthread_create.c:308
#8 0x00007ffff608aced in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:113
(gdb) list
177 unsigned char *be = bp + len; /* beyond end of buffer
*/
178
179 /* FNV-1a hash; assume we have stdint.h available */
180 while (bp < be) {
181 /* xor the bottom with the current octet */
182 seed ^= (uint64_t)*bp++;
183 /* multiply by the 64 bit FNV magic prime mod 2^64 */
184 seed *= FNV_64_PRIME;
185 }
186
(gdb) print bp
$1 = (unsigned char *) 0x1 <Address 0x1 out of bounds>
(gdb) print be
$2 = (unsigned char *) 0x161 <Address 0x161 out of bounds>
(gdb) print seed
$3 = 0
best regards,
Konrad Meier
ganglia-conf.tar.gz
Description: application/gzip
------------------------------------------------------------------------------
_______________________________________________ Ganglia-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-developers
