Caught one of these again.  Interesting that they usually seem to
happen on one cluster, and not on any of our other clusters running
the same OS & Ganglia version.  Something about the app mix running
on this cluster (Apache web servers) may contribute to triggerring
the problem...

Anyway, on the grid server we get a bunch of:
Nov  3 13:07:01 gridserver /usr/sbin/gmetad[16299]: poll() timeout for [Web] 
data source after 0 bytes read 

On the first node listed in the data_source line, gmond is running.

lsof shows:
  ...
gmond   3922 ganglia  mem    REG       8,3   143336 20331194 
/usr/lib64/libexpat.so.0.5.0
gmond   3922 ganglia  mem    REG       8,3   110248 10371175 
/lib64/libnsl-2.3.4.so
gmond   3922 ganglia    0r   CHR       1,3              2223 /dev/null
gmond   3922 ganglia    1w   CHR       1,3              2223 /dev/null
gmond   3922 ganglia    2w   CHR       1,3              2223 /dev/null
gmond   3922 ganglia    3r  0000       0,8        0     7554 eventpoll
gmond   3922 ganglia    4u  IPv4      7556               UDP 239.192.0.127:8649
gmond   3922 ganglia    5u  IPv4      7562               TCP *:8649 (LISTEN)
gmond   3922 ganglia    6u  IPv4      7564               UDP 
nodename:32769->239.192.0.127:8649
gmond   3922 ganglia    7u  IPv4 172429044               TCP 
nodename:8649->gridserver:48675 (CLOSE_WAIT)

strace shows:
$ strace -p 3922
Process 3922 attached - interrupt to quit
write(7, "<EXTRA_ELEMENT NAME=\"DESC\" VAL=\""..., 59

... and there it waits forever.  Certainly longer than the timeout for
a CLOSE_WAIT connection.

Probably not useful, but just in case, ps (both formats) shows:

$ ps uwp 3922
USER       PID %CPU %MEM   VSZ  RSS TTY      STAT START   TIME COMMAND
ganglia   3922  0.1  0.0 80108 6348 ?        Ss   Oct23  23:05 /usr/sbin/gmond
$ ps -lfp 3922
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
5 S ganglia   3922     1  0  76   0 - 20027 -      Oct23 ?        00:23:05 
/usr/sbin/gmond

While it is "frozen" on tcp port 8649, however, it seems to continue
communicating with other nods by multicast.  If I change the
data_source line in gmetad.conf and restart gemtad, I start getting
metrics for the whole cluster, including the "frozen" host, even
though attempts to poll it directly on TCP all hang.

Restarting gmond on the semi-frozen node, as usual, fixes the problem.

Note: This problem would be less serious if gmetad counted this error
as the same as a connection failure, and moved on to the next host
listed in the data_source line.
  -- Cos

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to