On Feb 1, 2006, at 12:23 PM, Y. Huang wrote:
I installed Ganglia 3.0.2 on a dual Xeon EM64T cluster running Red Hat
Enterprise Linux 4.0. The Ganglia web has been running OK for about 2
months. However, a problem suddenly came up this morning:
The remote nodes appeared to be down on the Ganglia web page (they
were
actually up). I restarted the gmond on the head node, then the
Ganglia web
page showed these remote nodes were up, but exactly after 5
minutes, the
Ganglia web page said these nodes were down again.
Anyone know what was the problem? Thanks a lot for your help.
Yiye: We've seen a similar thing here at Tufts: "up" nodes being
reported as "down" in Ganglia. The problem appears to be with the
function "host_alive($host, $cluster)" (starting on line 43 of
ganglia.php in the web docs.) The PHP front-end shows a host as
"down" if the TN value in the XML report is more than four times
greater than the TMAX value:
if ($host['TN'] > $host['TMAX'] * 4)
return FALSE;
$host_up = FALSE;
So in this snippet from our XML, this host would be marked "down":
<HOST NAME="obscured" IP="192.168.4.104" REPORTED="1139073925"
TN="135" TMAX="20" DMAX="3600" LOCATION="unspecified"
GMOND_STARTED="1131659803">
Since 135 > 20 * 4, the host is down.
The trick is figuring out where, in gmond.conf, to change the value
of TMAX! I'm betting here:
/* This collection group will cause a heartbeat (or beacon) to be
sent every
20 seconds. In the heartbeat is the GMOND_STARTED data which
expresses
the age of the running gmond. */
collection_group {
collect_once = yes
time_threshold = 20
metric {
name = "heartbeat"
}
}
Hope this is helpful...
pjm