[Ganglia-general] gmetad bug when gmond host hangs

Ofer Inbar Fri, 29 Aug 2008 10:18:42 -0700

In my gmetad.conf I have:
data_source "Cluster" host1 host2 host3 host4 host5 host6


host1 froze, but its TCP/IP stack stayed up, so you would get TCP
connections to ports where something was listening, but the service
supposedly listening on that port would never connect - this affected
ssh, gmond, and anything else on that host.

gmetad was logging this error every 12-14 seconds:
/usr/sbin/gmetad[24938]: poll() timeout for [Cluster] data source after 0 bytes 
read

It never fell back to any of the other hosts in the cluster, even
though hosts2 and up were all still up, so it stopped collecting all
metrics for the cluster.

I changed the data_source line to:
data_source "Cluster" host2 host3 host4 host5 host6 host1

restarted gmetad, and it started storing metrics for the cluster right away.

I would expect that if gmetad gets a timeout without getting the XML
data from one data source, it would log an error saying which host it
had a problem with, then fall back and try the next one.

Is this a bug?  Any known workarounds?  Should I report it on bugzilla?
  -- Cos

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

[Ganglia-general] gmetad bug when gmond host hangs

Reply via email to