Hi all,
I've been testing Ganglia for a while on a cluster of approximately 280
hosts. Approximately 260 of these are of one class -- data hosts -- and
another 18-20 are of another class -- compute hosts.
I had numerous issues getting Ganglia to work reliably while it was
configured to use multicast. While I'm unsure if these were directly
caused by multicast, turning off multicast and moving to a unicast model
has improved reliability considerably.
However, I'm still experiencing some issues for which I have not
determined explainable reasons for, and would love any feedback anyone
has. I have a deadline of this Friday to make a "go/no-go" decision on
whether to use Ganglia as my cluster monitoring and reporting package. I
would like to use Ganglia rather than another package (given the
features Ganglia has and the time invested in it thus far), but only if
I can be comfortable about it's reliability; either by having
explainable reasons for aberrant behavior and/or why I should be using a
different configuration (which I can then take into account and work
around), or by having these issues fixed, and preferably both over time.
1) When starting gmond on all the hosts and gmetad on the monitor
host/head node (from a complete shutdown of gmond on all the hosts, a
shutdown of gmetad on the monitor host/head node, and a purge of the
RRDs on the gmetad host), gmond will start up fine but gmetad seems to
lag behind in reporting even when all hosts are up and gmond is running.
I've found that by initiating a network connection (e.g. ssh or simply
using netcat (nc) to the TCP port on each host running gmond) to each of
these hosts from the monitor host/head node will then prod gmetad into
reporting for them. This seems odd. I would hope that gmetad would start
aggregating stats for each host once gmond was back up and running.
Having to make a network connection to each host in order to get gmetad
to see them is bad, since if gmond were to go down on a host (e.g.
because the host itself went down) and come back up, gmetad may not see
it until something else connected to it. However, if gmetad doesn't see
it come back up, the cluster software I'm using will mark it as down and
will intentionally not spawn connections to it. Has anyone else
encountered this or a similar problem?
2) Early last week I noticed that three compute hosts stopped reporting
in gmetad, though those hosts were physically up and alive on the
network and gmond was running. Using nc on the gmond TCP port returned
the usual XML feed. Stopping and then starting gmond on these hosts
seemed to do the trick, but there is no clear reason why gmetad lost
track of them in the first place.
3) Load on the monitor host/head node seems higher than it should be. It
hovers around 2.6 - 3.0. While other software is running on this host,
shutting down gmetad results in load falling back down to levels similar
to other compute hosts (since the monitor host/head node is currently
also a host in the Compute Hosts cluster). Also, in concert with the
higher than expected load, ssh sessions to the monitor host/head node
seem to take a long time to establish. Again, shutting down gmetad seems
to alleviate these problems. While both of these issues don't prevent
work from being done or gmetad from working (in the current
configuration), it does seem abnormally high and is something of an
annoyance.
4) Snippets of my current configuration are provided below. I can
provide more information if needed. Host names and what not are changed,
but the particulars are the same. Should I be doing something else than
what I am doing below? Anything not specified is left as the default
setting.
gmetad.conf excerpt:
--------------------
data_source "Compute Hosts" headnode:8649
data_source "Data Hosts" datahost2020:8649
scalable off
gridname "Our Cluster"
authority "http://headnode/ganglia/"
all_trusted on
gmond.conf (for data nodes) excerpt:
------------------------------------
globals {
setuid = yes
user = nobody
cleanup_threshold = 300 /*secs */
}
cluster {
name = "Data Hosts"
owner ="Company"
url = "http://www.example.com/"
}
udp_send_channel {
host = datahost2020
port = 8649
}
/* [ We set udp_recv_channel to be 8649 mostly just so
* that the host specified in udp_send_channel above
* (datahost2020 for the Data Hosts) can receive XDR
* from other Data Hosts. ]
*/
udp_recv_channel {
port = 8649
}
/* [ "timeout = -1" turns on blocking I/O, which should
* alleviate XML corruption issues. ]
*/
tcp_accept_channel {
port = 8649
timeout = -1
The gmond.conf excerpts for my compute hosts is exactly the same as the
one shown above for the data hosts, except for the following change,
which just specifies the host for compute hosts to report to, which is
the same as the monitor host/head node (i.e. it's also running gmetad
and the web frontend):
udp_send_channel {
host = headnode
port = 8649
}
One thought I had was to add another layer of gmetad reporting, and put
a host running gmetad dedicated for the data hosts, another dedicated
for the compute hosts, and then *another* independent machine running
gmetad which will poll both of those cluster-specific gmetad
aggregators. This seems like it shouldn't be necessary though.
This is a lot for anyone to read, so if you've gotten this far, thanks
for reading! If anyone has any feedback, I'd love to hear from you. I'd
really like to make a "go" decision on Ganglia if I can figure out
what's causing these issues and work on solutions or workarounds that
still let me benefit from the rest of Ganglia's functionality.
Again, thanks in advance!
--
Dan Moniz <[EMAIL PROTECTED]> [http://www.pobox.com/~dnm/]