In my gmetad.conf I have: data_source "Cluster" host1 host2 host3 host4 host5 host6
host1 froze, but its TCP/IP stack stayed up, so you would get TCP connections to ports where something was listening, but the service supposedly listening on that port would never connect - this affected ssh, gmond, and anything else on that host. gmetad was logging this error every 12-14 seconds: /usr/sbin/gmetad[24938]: poll() timeout for [Cluster] data source after 0 bytes read It never fell back to any of the other hosts in the cluster, even though hosts2 and up were all still up, so it stopped collecting all metrics for the cluster. I changed the data_source line to: data_source "Cluster" host2 host3 host4 host5 host6 host1 restarted gmetad, and it started storing metrics for the cluster right away. I would expect that if gmetad gets a timeout without getting the XML data from one data source, it would log an error saying which host it had a problem with, then fall back and try the next one. Is this a bug? Any known workarounds? Should I report it on bugzilla? -- Cos ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ Ganglia-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-general

