On Feb 9, 2006, at 6:44 PM, Y. Huang wrote:
I changed
  if ($host['TN'] > $host['TMAX'] * 4)
to
  if ($host['TN'] > $host['TMAX'] * 8)
for a test.

I was observing the whole process of the status changing of these remote nodes on the web from restarting the gmond until these nodes appearred to
be down:

For multiplier of 4, these nodes's hearbeats are cycling every 20s for up to 180s, then the head node stopped received hearbeats from these remote
nodes. These nodes appeared to be down after 80s (these nodes become
'dead' at the point "Last heartbeat received 80 second ago").

For multiplier of 8, these nodes's hearbeats are cycling every 20s for up to 180s, then the head node stopped received hearbeats from these remote
nodes. These nodes appeared to be down after 160s (these nodes become
'dead' at the point "Last heartbeat received 160 second ago").

It seems that these nodes stopped communicate with the head node after
180s. Increasing the multiplier just postpone the time when these nodes
become 'dead'.

You're right: what worked for us (I haven't seen a node "down" since we set the multiplier to 10) isn't working for you. It sounds like you're getting different data from your clients.

Have you tried 'telnet localhost 8651' to look at the data coming from the nodes? The information Ganglia's PHP interface uses to make decisions about whether a host is down come in the HOST tags, and they'll look something like this:

<HOST NAME="node12" IP="192.168.4.94" REPORTED="1139073927" TN="132" TMAX="20" DMAX="3600" LOCATION="unspecified" GMOND_STARTED="1131659804">

What are the TN and TMAX values for one of your hosts?

Of course, this is data the head node has already received, and your problem might be "upstream," so to speak, in which case I'm not really sure what to tell you.

pjm

Attachment: PGP.sig
Description: This is a digitally signed message part

Reply via email to