Bernard Li wrote:
Hi Cameron:

On Thu, Feb 25, 2010 at 1:48 PM, Cameron Spitzer <[email protected]> wrote:

  
We've had Ganglia running for several weeks on about 25 hosts.  One
gmetad and 25 gmonds.
Two weeks ago we had to replace one of the hosts.  The new one has the
same gmond and IP address.
    

Same hostname too I presume?  On gmetad, your hosts show up with
hostnames, correct?
  
Yes, same hostname.
  
Telnet from the master to the new host gives an XML document, same as
the old one.
    

What I would test is telnet (or nc) from master to _another_ host and
make sure that it has metrics from the "new" host.
  
I don't understand that at all.  Host A is running gmetad.
Host B (gmond)  is not getting graphed, even though it sends XML.
Hosts C through W are working fine.

How would telnet from A to C tell me what's wrong with B?
Why would host C know anything about host B?
Should any gmond host have information about all the other gmond hosts?
In any case, the telnet output is the same from B and from C.
There is no reference to any hosts in it.

Are you using multicast (default) or unicast?\
  
Unicast.

The web frontend to gmetad reports the new host is down.  No combination
of restarts brings it back.
    

Just FYI, the recommended combination should be:

1) stop gmetad
2) stop all gmond
3) start all gmond
4) start gmetad

  
I did that, and the replaced host is still showing as dead.

Is there some known procedure for replacing a Ganglia-monitored host?
Do I have to remove
the old one's rrd files or something?  Is this documented anywhere?
    

You shouldn't have to.  Curious if there are any errors in your httpd
logs?  Also, I assume you're not having time synchronization issues?
  
Time sync is pretty good here.  We use ntp everywhere.

http throws errors every time one of the graphs refreshes.

ERROR: opening '/var/lib/ganglia/rrds/gangliatest/p4-icmse-01-node1.nvidia.com/bytes_in.rrd': No such file or directory
[Thu Feb 25 17:01:11 2010] [error] [client 172.17.129.212] PHP Notice:  Undefined index:  cpu_num in /p4/www/htdocs/ganglia/functions.php on line 179
[Thu Feb 25 17:01:11 2010] [error] [client 172.17.129.212] PHP Notice:  Undefined index:  load_one in /p4/www/htdocs/ganglia/functions.php on line 184


That's from a host we stopped monitoring.  I removed its polluted rrd files.
Maybe I should have removed its whole gmetad directory.
(Incidentally, the REMOVE_BOGUS_SPIKES hack doesn't work, as shipped,
because its thresholds are way too high.  I reduced them to twice what the
physical link can actually do.  I'm logging one of the hosts, and it discards
about twenty bogus data points per day now.)


-Cameron




This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to