Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)

Devon H. O'Dell Mon, 15 Sep 2014 17:47:36 -0700

If you can install the dbg or dbgsym package for this, you can get
more information. If you cannot do this, running:


objdump -d `which gmond` | less

in less:

/40547c

Paste a little context of the disassembly before and after that
address, then scroll up and paste which function it's in. (That might
still be too little information or even bad information if the binary
is stripped. But it's something.)

--dho

2014-09-14 18:09 GMT-07:00 Sam Barham <s.bar...@adinstruments.com>:
> I've finally managed to generate a core dump (the VM wasn't set up to do it
> yet), but it's 214Mb and doesn't seem to contain anything helpful -
> especially as I don't have debug symbols.  The backtrace shows:
> #0  0x000000000040547c in ?? ()
> #1  0x00007f600a49a245 in hash_foreach () from
> /usr/lib/libganglia-3.3.8.so.0
> #2  0x00000000004054e1 in ?? ()
> #3  0x00007f600a49a245 in hash_foreach () from
> /usr/lib/libganglia-3.3.8.so.0
> #4  0x00000000004054e1 in ?? ()
> #5  0x00007f600a49a245 in hash_foreach () from
> /usr/lib/libganglia-3.3.8.so.0
> #6  0x0000000000405436 in ?? ()
> #7  0x000000000040530d in ?? ()
> #8  0x00000000004058fa in ?? ()
> #9  0x00007f6008ef9b50 in start_thread () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x00007f6008c43e6d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x0000000000000000 in ?? ()
>
> Is there a way for me to get more useful information out of it?
>
> On Fri, Sep 12, 2014 at 10:11 AM, Devon H. O'Dell <devon.od...@gmail.com>
> wrote:
>>
>> Are you able to share a core file?
>>
>> 2014-09-11 14:32 GMT-07:00 Sam Barham <s.bar...@adinstruments.com>:
>> > We are using Ganglia to monitoring our cloud infrastructure on Amazon
>> > AWS.
>> > Everything is working correctly (metrics are flowing etc), except that
>> > occasionally the gmetad process will segfault out of the blue. The
>> > gmetad
>> > process is running on an m3.medium EC2, and is monitoring about 50
>> > servers.
>> > The servers are arranged into groups, each one having a bastion EC2
>> > where
>> > the metrics are gathered. gmetad is configured to grab the metrics from
>> > those bastions - about 10 of them.
>> >
>> > Some useful facts:
>> >
>> > We are running Debian Wheezy on all the EC2s
>> > Sometimes the crash will happen multiple times in a day, sometimes it'll
>> > be
>> > a day or two before it crashes
>> > The crash creates no logs in normal operation other than a segfault log
>> > something like "gmetad[11291]: segfault at 71 ip 000000000040547c sp
>> > 00007ff2d6572260 error 4 in gmetad[400000+e000]". If we run gmetad
>> > manually
>> > with debug logging, it appears that the crash is related to gmetad doing
>> > a
>> > cleanup.
>> > When we realised that the cleanup process might be to blame we did more
>> > research around that. We realised that our disk IO was way too high and
>> > added rrdcached in order to reduce it. The disk IO is now much lower,
>> > and
>> > the crash is occurring less often, but still an average of once a day or
>> > so.
>> > We have two systems (dev and production). Both exhibit this crash, but
>> > the
>> > dev system, which is monitoring a much smaller group of servers crashes
>> > significantly less often.
>> > The production system is running ganglia 3.3.8-1+nmu1/rrdtool 1.4.7-2.
>> > We've
>> > upgraded ganglia in the dev systems to ganglia 3.6.0-2~bpo70+1/rrdtool
>> > 1.4.7-2. That doesn't seem to have helped with the crash.
>> > We have monit running on both systems configured to restart gmetad if it
>> > dies. It restarts immediately with no issues.
>> > The production system is storing it's data on a magnetic disk, the dev
>> > system is using ssd.  That doesn't appear to have changed the frequency
>> > of
>> > the crash.
>> >
>> > Has anyone experienced this kind of crash, especially on Amazon
>> > hardware?
>> > We're at our wits end trying to find a solution!
>> >
>> >
>> >
>> > ------------------------------------------------------------------------------
>> > Want excitement?
>> > Manually upgrade your production database.
>> > When you want reliability, choose Perforce
>> > Perforce version control. Predictably reliable.
>> >
>> > http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>> > _______________________________________________
>> > Ganglia-general mailing list
>> > Ganglia-general@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/ganglia-general
>> >
>
>
>
> ------------------------------------------------------------------------------
> Want excitement?
> Manually upgrade your production database.
> When you want reliability, choose Perforce
> Perforce version control. Predictably reliable.
> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
> _______________________________________________
> Ganglia-general mailing list
> Ganglia-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)

Reply via email to