Are you able to share a core file? 2014-09-11 14:32 GMT-07:00 Sam Barham <s.bar...@adinstruments.com>: > We are using Ganglia to monitoring our cloud infrastructure on Amazon AWS. > Everything is working correctly (metrics are flowing etc), except that > occasionally the gmetad process will segfault out of the blue. The gmetad > process is running on an m3.medium EC2, and is monitoring about 50 servers. > The servers are arranged into groups, each one having a bastion EC2 where > the metrics are gathered. gmetad is configured to grab the metrics from > those bastions - about 10 of them. > > Some useful facts: > > We are running Debian Wheezy on all the EC2s > Sometimes the crash will happen multiple times in a day, sometimes it'll be > a day or two before it crashes > The crash creates no logs in normal operation other than a segfault log > something like "gmetad[11291]: segfault at 71 ip 000000000040547c sp > 00007ff2d6572260 error 4 in gmetad[400000+e000]". If we run gmetad manually > with debug logging, it appears that the crash is related to gmetad doing a > cleanup. > When we realised that the cleanup process might be to blame we did more > research around that. We realised that our disk IO was way too high and > added rrdcached in order to reduce it. The disk IO is now much lower, and > the crash is occurring less often, but still an average of once a day or so. > We have two systems (dev and production). Both exhibit this crash, but the > dev system, which is monitoring a much smaller group of servers crashes > significantly less often. > The production system is running ganglia 3.3.8-1+nmu1/rrdtool 1.4.7-2. We've > upgraded ganglia in the dev systems to ganglia 3.6.0-2~bpo70+1/rrdtool > 1.4.7-2. That doesn't seem to have helped with the crash. > We have monit running on both systems configured to restart gmetad if it > dies. It restarts immediately with no issues. > The production system is storing it's data on a magnetic disk, the dev > system is using ssd. That doesn't appear to have changed the frequency of > the crash. > > Has anyone experienced this kind of crash, especially on Amazon hardware? > We're at our wits end trying to find a solution! > > > ------------------------------------------------------------------------------ > Want excitement? > Manually upgrade your production database. > When you want reliability, choose Perforce > Perforce version control. Predictably reliable. > http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk > _______________________________________________ > Ganglia-general mailing list > Ganglia-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ganglia-general >
------------------------------------------------------------------------------ Want excitement? Manually upgrade your production database. When you want reliability, choose Perforce Perforce version control. Predictably reliable. http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk _______________________________________________ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general