Caught one of these again. Interesting that they usually seem to happen on one cluster, and not on any of our other clusters running the same OS & Ganglia version. Something about the app mix running on this cluster (Apache web servers) may contribute to triggerring the problem...
Anyway, on the grid server we get a bunch of: Nov 3 13:07:01 gridserver /usr/sbin/gmetad[16299]: poll() timeout for [Web] data source after 0 bytes read On the first node listed in the data_source line, gmond is running. lsof shows: ... gmond 3922 ganglia mem REG 8,3 143336 20331194 /usr/lib64/libexpat.so.0.5.0 gmond 3922 ganglia mem REG 8,3 110248 10371175 /lib64/libnsl-2.3.4.so gmond 3922 ganglia 0r CHR 1,3 2223 /dev/null gmond 3922 ganglia 1w CHR 1,3 2223 /dev/null gmond 3922 ganglia 2w CHR 1,3 2223 /dev/null gmond 3922 ganglia 3r 0000 0,8 0 7554 eventpoll gmond 3922 ganglia 4u IPv4 7556 UDP 239.192.0.127:8649 gmond 3922 ganglia 5u IPv4 7562 TCP *:8649 (LISTEN) gmond 3922 ganglia 6u IPv4 7564 UDP nodename:32769->239.192.0.127:8649 gmond 3922 ganglia 7u IPv4 172429044 TCP nodename:8649->gridserver:48675 (CLOSE_WAIT) strace shows: $ strace -p 3922 Process 3922 attached - interrupt to quit write(7, "<EXTRA_ELEMENT NAME=\"DESC\" VAL=\""..., 59 ... and there it waits forever. Certainly longer than the timeout for a CLOSE_WAIT connection. Probably not useful, but just in case, ps (both formats) shows: $ ps uwp 3922 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND ganglia 3922 0.1 0.0 80108 6348 ? Ss Oct23 23:05 /usr/sbin/gmond $ ps -lfp 3922 F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 5 S ganglia 3922 1 0 76 0 - 20027 - Oct23 ? 00:23:05 /usr/sbin/gmond While it is "frozen" on tcp port 8649, however, it seems to continue communicating with other nods by multicast. If I change the data_source line in gmetad.conf and restart gemtad, I start getting metrics for the whole cluster, including the "frozen" host, even though attempts to poll it directly on TCP all hang. Restarting gmond on the semi-frozen node, as usual, fixes the problem. Note: This problem would be less serious if gmetad counted this error as the same as a connection failure, and moved on to the next host listed in the data_source line. -- Cos ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Ganglia-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-general

