We are using Ganglia to monitoring our cloud infrastructure on Amazon AWS.
Everything is working correctly (metrics are flowing etc), except that
occasionally the gmetad process will segfault out of the blue. The gmetad
process is running on an m3.medium EC2, and is monitoring about 50 servers.
The servers are arranged into groups, each one having a bastion EC2 where
the metrics are gathered. gmetad is configured to grab the metrics from
those bastions - about 10 of them.
Some useful facts:
- We are running Debian Wheezy on all the EC2s
- Sometimes the crash will happen multiple times in a day, sometimes
it'll be a day or two before it crashes
- The crash creates no logs in normal operation other than a segfault
log something like "gmetad[11291]: segfault at 71 ip 000000000040547c sp
00007ff2d6572260 error 4 in gmetad[400000+e000]". If we run gmetad manually
with debug logging, it appears that the crash is related to gmetad doing a
cleanup.
- When we realised that the cleanup process might be to blame we did
more research around that. We realised that our disk IO was way too high
and added rrdcached in order to reduce it. The disk IO is now much lower,
and the crash is occurring less often, but still an average of once a day
or so.
- We have two systems (dev and production). Both exhibit this crash, but
the dev system, which is monitoring a much smaller group of servers crashes
significantly less often.
- The production system is running ganglia 3.3.8-1+nmu1/rrdtool 1.4.7-2.
We've upgraded ganglia in the dev systems to ganglia
3.6.0-2~bpo70+1/rrdtool 1.4.7-2. That doesn't seem to have helped with the
crash.
- We have monit running on both systems configured to restart gmetad if
it dies. It restarts immediately with no issues.
- The production system is storing it's data on a magnetic disk, the dev
system is using ssd. That doesn't appear to have changed the frequency of
the crash.
Has anyone experienced this kind of crash, especially on Amazon hardware?
We're at our wits end trying to find a solution!
------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general