Hello,
I am fairly new to Ganglia, and have a problem with Ganglia
3.6.0 / Ganglia Web 3.5.10: In one of my clusters, after I
add approximately 55 hosts, the graphs go blank. I think
it's very similar to http://www.mail-archive.com/ganglia-general%40lists.sourceforge.net/msg07852.html,
but that thread is old and does not seem to have been
resolved.
I have five clusters -- four are small (up to 40 hosts) and
are working fine. The last one, "HPC Cluster", is where I am
having trouble. I start gmond 3-5 servers at a
time and after approximately 55 hosts, the web site starts
showing blackouts. If I add more hosts the blackout becomes
permanent. The blackouts present as follows:
* In the Grid report the number of hosts and cores for akk
clusters is correct and allhosts are marked "up".
* In the Cluster report, the number of cores is correct but
some hosts are marked "down". Actually, most of the time
they're all marked "down".
* In both the Grid report and the Cluster report the graphs
for HPC Cluster are blank.
* None of
this affects any of the other clusters.
I've read a number of threads and tried to anticipate the
most common questions.
* All servers use NTP and running "date" on all of them
shows they're synchronized to a second or so.
* I do not see the message "illegal attempt to update using
time X when last update time is X".
* I moved gmetad to a bigger box (16 cores, 256 GB RAM,
negligible prior usage). Didn't even increase the number of
hosts I can add before the blackouts start.
* All data
sources use the default polling interval, 15 seconds.
* Tried adding the servers in different orders.
* No errors in the logs, no errors of I start gmond and
gmetad with -d.
* I ran "netstat -su" on all boxes, and there were no
packets dropped anywhere.
* I ran "netstat -au" on the gmetad box and "Recv-Q" and
'Send-Q" were always 0.
* The server
where gmetad runs has a UDP buffer
(/proc/sys/net/core/rmem_max) of 4194304.
* I dumped the RRDs and in these blank areas the metrics are
"NaN".
* During blackouts, I tried "telnet <gmetad servicing
node> <port>" and always got an immediate and
apparently full response.
* The cluster's gmonds are all multicast and all listen and
send. I tried unicast and I tried a deaf/mute
multicastwithout any improvement.
I guess that the fact that none of the other clusters is
impacted means it's not a resources issie. I therefore
assume it's a configuration or architecture issue. I can
post the configuration files of gmetad and gmond, but this
post is pretty long as it is. So, in short:
* I am using one gmetad for all clusters.
* The data source for "HPC Cluster" has 6 nodes servicing it
(n800, n816, n832, n848, n864, n880) and uses port 8650.
I
originally only had two nodes. When the blackouts started, I
added more.
* Gmond on
all hosts uses six multicast channels, the same 6 nodes
(n800, n816, n832, n848, n864, n880) on port 8650.
I
originally had a single multicast channel. When the
blackouts started, I added more.
* Gmond on
all hosts listens on UDP and TCP on port 8650.
Since none of the other clusters is impected, I could
probably split this cluster in smaller clusters and those
would work, but this will make reporting the full cluster
usage more painful.
Any suggestions or ideas would be welcome.