Ganglia 3.1.0 on CentOS 4. Earlier today one of my clusters stopped reporting. Grid server logged these to syslog: /usr/sbin/gmetad[12271]: poll() timeout for [Web] data source after 0 bytes read
I checked that gmond was running on that host, and it was. However, attempts to connect to its port 8649 would indeed timeout. I tried to see what it was doing and got: # strace -p 16830 Process 16830 attached - interrupt to quit write(7, "<EXTRA_DATA>\n", 13 <unfinished ...> Process 16830 detached ... I had to ^C after a minute. I captured lsof output, then restarted gmond, and it started working. Here's the lsof output: # lsof | grep gmond gmond 16830 ganglia cwd DIR 8,3 4096 2 / gmond 16830 ganglia rtd DIR 8,3 4096 2 / gmond 16830 ganglia txt REG 8,3 62688 3446145 /usr/sbin/gmond gmond 16830 ganglia mem REG 8,3 48517056 2084173 /usr/lib/locale/locale-archive gmond 16830 ganglia mem REG 8,3 64872 2279574 /usr/lib64/ganglia/modcpu.so gmond 16830 ganglia mem REG 8,3 62512 2279575 /usr/lib64/ganglia/moddisk.so gmond 16830 ganglia mem REG 8,3 62480 2279576 /usr/lib64/ganglia/modload.so gmond 16830 ganglia mem REG 8,3 63720 2279577 /usr/lib64/ganglia/modmem.so gmond 16830 ganglia mem REG 8,3 62824 2279579 /usr/lib64/ganglia/modnet.so gmond 16830 ganglia mem REG 8,3 62224 2279580 /usr/lib64/ganglia/modproc.so gmond 16830 ganglia mem REG 8,3 63432 2279581 /usr/lib64/ganglia/modsys.so gmond 16830 ganglia mem REG 8,3 56902 966686 /lib64/libnss_files-2.3.4.so gmond 16830 ganglia mem REG 8,3 105080 966890 /lib64/ld-2.3.4.so gmond 16830 ganglia mem REG 8,3 1493409 966891 /lib64/tls/libc-2.3.4.so gmond 16830 ganglia mem REG 8,3 11784 966731 /lib64/libuuid.so.1.2 gmond 16830 ganglia mem REG 8,3 17943 966893 /lib64/libdl-2.3.4.so gmond 16830 ganglia mem REG 8,3 106203 966894 /lib64/tls/libpthread-2.3.4.so gmond 16830 ganglia mem REG 8,3 91412 966660 /lib64/libresolv-2.3.4.so gmond 16830 ganglia mem REG 8,3 30070 966906 /lib64/libcrypt-2.3.4.so gmond 16830 ganglia mem REG 8,3 143336 3801241 /usr/lib64/libexpat.so.0.5.0 gmond 16830 ganglia mem REG 8,3 107187 966901 /lib64/libnsl-2.3.4.so gmond 16830 ganglia mem REG 8,3 88824 3801183 /usr/lib64/libganglia-3.1.0.so.0.0.0 gmond 16830 ganglia mem REG 8,3 171976 3801192 /usr/lib64/libapr-1.so.0.3.2 gmond 16830 ganglia mem REG 8,3 46392 3801185 /usr/lib64/libconfuse.so.0.0.0 gmond 16830 ganglia 0r CHR 1,3 1977 /dev/null gmond 16830 ganglia 1w CHR 1,3 1977 /dev/null gmond 16830 ganglia 2w CHR 1,3 1977 /dev/null gmond 16830 ganglia 3r 0000 0,8 0 2039262 eventpoll gmond 16830 ganglia 4u IPv4 2039264 UDP 239.192.0.127:8649 gmond 16830 ganglia 5u IPv4 2039266 TCP *:8649 (LISTEN) Anyone seen this? Any clues as to what might have put it in this state? P.S. gmetad should've fallen back on another data source, but that's another email thread that we've already had :) -- Cos ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Ganglia-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-general

