I currently have 4 clusters all running 3.0.0.
I can monitor 3 of my 4 clusters in my grid just fine (i.e. I can see
and monitor the machines from my gmetad server), but these clusters
which work have at most 15 machines in them.
The 4th cluster, which has about 250 machines in it, does not show up on
the web front end.
I ran gmetad in debug mode and got this error:
Process XML (Athlon 2.0 ghz Cluster): XML_ParseBuffer() error at line
146:
not well-formed (invalid token)
When I telnet to one of the host, from my gmetad host, to check the XML
output, a few machines seem to mess up the XML output like this for
example:
....
<HOST NAME="renderserver672.george.com" IP="10.5.8.53"
REPORTED="1108599752" TN="1" TMAX="20" DMAX="0" L<METRIC NAME="pkts_out"
VAL="1.35" TYPE="float" UNITS="packets/sec" TN="151" TMAX="300" DMAX="0"
SLOPE="both" SOURCE="gmond"/>
.....
notice that the <METRIC tag occurs before the <HOST tag is closed. The
error seems to happen in random locations. Here's another example:
....
<METRIC NAME="pkts_out" VAL="4.47" TYPE="float" UNITS="packets/sec"
TN="107" TMAX="300" DMAX="0" SLOPE="both" S<METRIC NAME="cpu_num"
VAL="2" TYPE="uint16" UNITS="CPUs" TN="107" TMAX="1200" DMAX="0"
SLOPE="zero" SOURCE="gmond"/>
.....
This seems to randomly happen to different machines but occurs after
about every 200-500 XML lines (about 20 machines with stats).
Each cluster is set to multicast out on a unique multcast address and
port.
the gmond.conf files for each cluster were generated using gmond
--convert since they were running 2.5 before.
I think the confs are all OK since I use the same conf file on the other
clusters and they work just fine. Seems to be a problem when you load
up more then 20 or so machines in a cluster.
As an example, there's a snippet from a gmond.conf file on the cluster
which doesn't seem to work below:
Any help is very much appreciated.
Thanks,
-ERIC
-----------gmond.conf
/* global variables */
globals {
mute = "no"
deaf = "no"
debug_level = "0"
setuid = "yes"
user="nobody"
gexec = "yes"
host_dmax = "0"
}
/* info about your identity */
cluster {
name = "Athlon 2.0 ghz Cluster"
owner = "unspecified"
latlong = "unspecified"
url="unspecified"
}
/* channel to send multicast on mcast_channel:mcast_port */
udp_send_channel {
mcast_join = "224.0.19.81"
port = "8640"
ttl="5"
}
/* channel to receive multicast from mcast_channel:mcast_port */
udp_recv_channel {
mcast_join = "224.0.19.81"
port = "8640"
bind = "224.0.19.81"
}
/* channel to export xml on xml_port */
tcp_accept_channel {
port = "8640"
/* your trusted_hosts assuming ipv4 mask*/
acl{
default="deny"
access {
ip="10.5.45.25"
mask = 24
action = "allow"
}
access {
ip="10.5.45.2"
mask = 24
action = "allow"
}
}
}
.....all the metric definitions after this........
----------------------
---gmetad.conf---
data_source "Athlon 1.4 ghz Cluster" 15 10.5.8.20:8639
data_source "Athlon 2.0 ghz Cluster" 15 10.5.8.73:8640 10.5.8.5:8640
data_source "Opteron 2.0 ghz Cluster" 15 10.5.9.109:8641 10.5.9.122:8641
data_source "Other Cluster" 15 10.5.9.107:8642
....
----------------------
*IPs and cluster names have been changed slightly in these examples, but
I've triple checked that what's in the gmetad file matches what's in the
gmond file on the clients.