Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)

Devon H. O'Dell Thu, 11 Sep 2014 15:17:31 -0700

Are you able to share a core file?

2014-09-11 14:32 GMT-07:00 Sam Barham <s.bar...@adinstruments.com>:
> We are using Ganglia to monitoring our cloud infrastructure on Amazon AWS.
> Everything is working correctly (metrics are flowing etc), except that
> occasionally the gmetad process will segfault out of the blue. The gmetad
> process is running on an m3.medium EC2, and is monitoring about 50 servers.
> The servers are arranged into groups, each one having a bastion EC2 where
> the metrics are gathered. gmetad is configured to grab the metrics from
> those bastions - about 10 of them.
>
> Some useful facts:
>
> We are running Debian Wheezy on all the EC2s
> Sometimes the crash will happen multiple times in a day, sometimes it'll be
> a day or two before it crashes
> The crash creates no logs in normal operation other than a segfault log
> something like "gmetad[11291]: segfault at 71 ip 000000000040547c sp
> 00007ff2d6572260 error 4 in gmetad[400000+e000]". If we run gmetad manually
> with debug logging, it appears that the crash is related to gmetad doing a
> cleanup.
> When we realised that the cleanup process might be to blame we did more
> research around that. We realised that our disk IO was way too high and
> added rrdcached in order to reduce it. The disk IO is now much lower, and
> the crash is occurring less often, but still an average of once a day or so.
> We have two systems (dev and production). Both exhibit this crash, but the
> dev system, which is monitoring a much smaller group of servers crashes
> significantly less often.
> The production system is running ganglia 3.3.8-1+nmu1/rrdtool 1.4.7-2. We've
> upgraded ganglia in the dev systems to ganglia 3.6.0-2~bpo70+1/rrdtool
> 1.4.7-2. That doesn't seem to have helped with the crash.
> We have monit running on both systems configured to restart gmetad if it
> dies. It restarts immediately with no issues.
> The production system is storing it's data on a magnetic disk, the dev
> system is using ssd.  That doesn't appear to have changed the frequency of
> the crash.
>
> Has anyone experienced this kind of crash, especially on Amazon hardware?
> We're at our wits end trying to find a solution!
>
>
> ------------------------------------------------------------------------------
> Want excitement?
> Manually upgrade your production database.
> When you want reliability, choose Perforce
> Perforce version control. Predictably reliable.
> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
> _______________________________________________
> Ganglia-general mailing list
> Ganglia-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>


------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)

Reply via email to