On Feb 1, 2006, at 12:23 PM, Y. Huang wrote:
I installed Ganglia 3.0.2 on a dual Xeon EM64T cluster running Red Hat
Enterprise Linux 4.0. The Ganglia web has been running OK for about 2
months. However, a problem suddenly came up this morning:

The remote nodes appeared to be down on the Ganglia web page (they were actually up). I restarted the gmond on the head node, then the Ganglia web page showed these remote nodes were up, but exactly after 5 minutes, the
Ganglia web page said these nodes were down again.

Anyone know what was the problem? Thanks a lot for your help.

Yiye: We've seen a similar thing here at Tufts: "up" nodes being reported as "down" in Ganglia. The problem appears to be with the function "host_alive($host, $cluster)" (starting on line 43 of ganglia.php in the web docs.) The PHP front-end shows a host as "down" if the TN value in the XML report is more than four times greater than the TMAX value:

      if ($host['TN'] > $host['TMAX'] * 4)
         return FALSE;
         $host_up = FALSE;

So in this snippet from our XML, this host would be marked "down":

<HOST NAME="obscured" IP="192.168.4.104" REPORTED="1139073925" TN="135" TMAX="20" DMAX="3600" LOCATION="unspecified" GMOND_STARTED="1131659803">

Since 135 > 20 * 4, the host is down.

The trick is figuring out where, in gmond.conf, to change the value of TMAX! I'm betting here:

/* This collection group will cause a heartbeat (or beacon) to be sent every 20 seconds. In the heartbeat is the GMOND_STARTED data which expresses
   the age of the running gmond. */
collection_group {
  collect_once = yes
  time_threshold = 20
  metric {
    name = "heartbeat"
  }
}

Hope this is helpful...

pjm

Reply via email to