[Ganglia-general] Ganglia issues I've been experiencing

Dan Moniz Wed, 09 Mar 2005 15:23:27 -0800

Hi all,

I've been testing Ganglia for a while on a cluster of approximately 280hosts. Approximately 260 of these are of one class -- data hosts -- andanother 18-20 are of another class -- compute hosts.

I had numerous issues getting Ganglia to work reliably while it wasconfigured to use multicast. While I'm unsure if these were directlycaused by multicast, turning off multicast and moving to a unicast modelhas improved reliability considerably.

However, I'm still experiencing some issues for which I have notdetermined explainable reasons for, and would love any feedback anyonehas. I have a deadline of this Friday to make a "go/no-go" decision onwhether to use Ganglia as my cluster monitoring and reporting package. Iwould like to use Ganglia rather than another package (given thefeatures Ganglia has and the time invested in it thus far), but only ifI can be comfortable about it's reliability; either by havingexplainable reasons for aberrant behavior and/or why I should be using adifferent configuration (which I can then take into account and workaround), or by having these issues fixed, and preferably both over time.

1) When starting gmond on all the hosts and gmetad on the monitorhost/head node (from a complete shutdown of gmond on all the hosts, ashutdown of gmetad on the monitor host/head node, and a purge of theRRDs on the gmetad host), gmond will start up fine but gmetad seems tolag behind in reporting even when all hosts are up and gmond is running.I've found that by initiating a network connection (e.g. ssh or simplyusing netcat (nc) to the TCP port on each host running gmond) to each ofthese hosts from the monitor host/head node will then prod gmetad intoreporting for them. This seems odd. I would hope that gmetad would startaggregating stats for each host once gmond was back up and running.Having to make a network connection to each host in order to get gmetadto see them is bad, since if gmond were to go down on a host (e.g.because the host itself went down) and come back up, gmetad may not seeit until something else connected to it. However, if gmetad doesn't seeit come back up, the cluster software I'm using will mark it as down andwill intentionally not spawn connections to it. Has anyone elseencountered this or a similar problem?

2) Early last week I noticed that three compute hosts stopped reportingin gmetad, though those hosts were physically up and alive on thenetwork and gmond was running. Using nc on the gmond TCP port returnedthe usual XML feed. Stopping and then starting gmond on these hostsseemed to do the trick, but there is no clear reason why gmetad losttrack of them in the first place.

3) Load on the monitor host/head node seems higher than it should be. Ithovers around 2.6 - 3.0. While other software is running on this host,shutting down gmetad results in load falling back down to levels similarto other compute hosts (since the monitor host/head node is currentlyalso a host in the Compute Hosts cluster). Also, in concert with thehigher than expected load, ssh sessions to the monitor host/head nodeseem to take a long time to establish. Again, shutting down gmetad seemsto alleviate these problems. While both of these issues don't preventwork from being done or gmetad from working (in the currentconfiguration), it does seem abnormally high and is something of anannoyance.

4) Snippets of my current configuration are provided below. I canprovide more information if needed. Host names and what not are changed,but the particulars are the same. Should I be doing something else thanwhat I am doing below? Anything not specified is left as the defaultsetting.


gmetad.conf excerpt:
--------------------

data_source "Compute Hosts" headnode:8649
data_source "Data Hosts" datahost2020:8649
scalable off
gridname "Our Cluster"
authority "http://headnode/ganglia/";
all_trusted on



gmond.conf (for data nodes) excerpt:
------------------------------------

globals {
  setuid = yes
  user = nobody
  cleanup_threshold = 300 /*secs */
}

cluster {
  name = "Data Hosts"
  owner ="Company"
  url = "http://www.example.com/";
}

udp_send_channel {
  host = datahost2020
  port = 8649
}

/* [ We set udp_recv_channel to be 8649 mostly just so
 *   that the host specified in udp_send_channel above
 *   (datahost2020 for the Data Hosts) can receive XDR
 *   from other Data Hosts. ]
 */

udp_recv_channel {
  port = 8649
}

/* [ "timeout = -1" turns on blocking I/O, which should
 *    alleviate XML corruption issues. ]
 */

tcp_accept_channel {
  port = 8649
  timeout = -1

The gmond.conf excerpts for my compute hosts is exactly the same as theone shown above for the data hosts, except for the following change,which just specifies the host for compute hosts to report to, which isthe same as the monitor host/head node (i.e. it's also running gmetadand the web frontend):


udp_send_channel {
  host = headnode
  port = 8649
}

One thought I had was to add another layer of gmetad reporting, and puta host running gmetad dedicated for the data hosts, another dedicatedfor the compute hosts, and then *another* independent machine runninggmetad which will poll both of those cluster-specific gmetadaggregators. This seems like it shouldn't be necessary though.

This is a lot for anyone to read, so if you've gotten this far, thanksfor reading! If anyone has any feedback, I'd love to hear from you. I'dreally like to make a "go" decision on Ganglia if I can figure outwhat's causing these issues and work on solutions or workarounds thatstill let me benefit from the rest of Ganglia's functionality.


Again, thanks in advance!


--
Dan Moniz <[EMAIL PROTECTED]> [http://www.pobox.com/~dnm/]

[Ganglia-general] Ganglia issues I've been experiencing

Reply via email to