Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)

Sam Barham Mon, 15 Sep 2014 19:01:47 -0700

I can't read assembly, so this doesn't mean much to me, but hopefully it'll
mean something to you :)



  40540e:       e9 fc fe ff ff          jmpq   40530f <openlog@plt+0x242f>
  405413:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  405418:       48 89 de                mov    %rbx,%rsi
  40541b:       e8 b0 fd ff ff          callq  4051d0 <openlog@plt+0x22f0>
  405420:       48 8b 7b 18             mov    0x18(%rbx),%rdi
  405424:       48 85 ff                test   %rdi,%rdi
  405427:       74 0d                   je     405436 <openlog@plt+0x2556>
  405429:       4c 89 e2                mov    %r12,%rdx
  40542c:       be 60 54 40 00          mov    $0x405460,%esi
  405431:       e8 ca d3 ff ff          callq  402800 <hash_foreach@plt>
  405436:       31 c0                   xor    %eax,%eax
  405438:       e9 f8 fe ff ff          jmpq   405335 <openlog@plt+0x2455>
  40543d:       0f 1f 00                nopl   (%rax)
  405440:       31 c9                   xor    %ecx,%ecx
  405442:       4c 89 ea                mov    %r13,%rdx
  405445:       31 f6                   xor    %esi,%esi
  405447:       4c 89 e7                mov    %r12,%rdi
  40544a:       4c 89 04 24             mov    %r8,(%rsp)
  40544e:       e8 3d fe ff ff          callq  405290 <openlog@plt+0x23b0>
  405453:       4c 8b 04 24             mov    (%rsp),%r8
  405457:       89 c5                   mov    %eax,%ebp
  405459:       eb ab                   jmp    405406 <openlog@plt+0x2526>
  40545b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  405460:       48 89 6c 24 f0          mov    %rbp,-0x10(%rsp)
  405465:       4c 89 64 24 f8          mov    %r12,-0x8(%rsp)
  40546a:       49 89 fc                mov    %rdi,%r12
  40546d:       48 89 5c 24 e8          mov    %rbx,-0x18(%rsp)
  405472:       48 83 ec 18             sub    $0x18,%rsp
  405476:       8b 7a 18                mov    0x18(%rdx),%edi
  405479:       48 89 d5                mov    %rdx,%rbp
  40547c:       48 8b 1e                mov    (%rsi),%rbx
  40547f:       85 ff                   test   %edi,%edi
  405481:       74 0c                   je     40548f <openlog@plt+0x25af>
  405483:       48 89 de                mov    %rbx,%rsi
  405486:       e8 15 fd ff ff          callq  4051a0 <openlog@plt+0x22c0>
  40548b:       85 c0                   test   %eax,%eax
  40548d:       74 12                   je     4054a1 <openlog@plt+0x25c1>
  40548f:       31 c9                   xor    %ecx,%ecx
  405491:       48 89 ea                mov    %rbp,%rdx
  405494:       4c 89 e6                mov    %r12,%rsi
  405497:       48 89 df                mov    %rbx,%rdi
  40549a:       ff 53 08                callq  *0x8(%rbx)
  40549d:       85 c0                   test   %eax,%eax
  40549f:       74 1f                   je     4054c0 <openlog@plt+0x25e0>
  4054a1:       b8 01 00 00 00          mov    $0x1,%eax
  4054a6:       48 8b 1c 24             mov    (%rsp),%rbx
  4054aa:       48 8b 6c 24 08          mov    0x8(%rsp),%rbp
  4054af:       4c 8b 64 24 10          mov    0x10(%rsp),%r12
  4054b4:       48 83 c4 18             add    $0x18,%rsp
  4054b8:       c3                      retq
  4054b9:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)
  4054c0:       48 89 ef                mov    %rbp,%rdi
  4054c3:       48 89 de                mov    %rbx,%rsi
  4054c6:       e8 05 fd ff ff          callq  4051d0 <openlog@plt+0x22f0>
  4054cb:       48 8b 7b 18             mov    0x18(%rbx),%rdi


On Tue, Sep 16, 2014 at 12:45 PM, Devon H. O'Dell <devon.od...@gmail.com>
 wrote:

> If you can install the dbg or dbgsym package for this, you can get
> more information. If you cannot do this, running:
>
> objdump -d `which gmond` | less
>
> in less:
>
> /40547c
>
> Paste a little context of the disassembly before and after that
> address, then scroll up and paste which function it's in. (That might
> still be too little information or even bad information if the binary
> is stripped. But it's something.)
>
> --dho
>
> 2014-09-14 18:09 GMT-07:00 Sam Barham <s.bar...@adinstruments.com>:
> > I've finally managed to generate a core dump (the VM wasn't set up to do
> it
> > yet), but it's 214Mb and doesn't seem to contain anything helpful -
> > especially as I don't have debug symbols.  The backtrace shows:
> > #0  0x000000000040547c in ?? ()
> > #1  0x00007f600a49a245 in hash_foreach () from
> > /usr/lib/libganglia-3.3.8.so.0
> > #2  0x00000000004054e1 in ?? ()
> > #3  0x00007f600a49a245 in hash_foreach () from
> > /usr/lib/libganglia-3.3.8.so.0
> > #4  0x00000000004054e1 in ?? ()
> > #5  0x00007f600a49a245 in hash_foreach () from
> > /usr/lib/libganglia-3.3.8.so.0
> > #6  0x0000000000405436 in ?? ()
> > #7  0x000000000040530d in ?? ()
> > #8  0x00000000004058fa in ?? ()
> > #9  0x00007f6008ef9b50 in start_thread () from
> > /lib/x86_64-linux-gnu/libpthread.so.0
> > #10 0x00007f6008c43e6d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> > #11 0x0000000000000000 in ?? ()
> >
> > Is there a way for me to get more useful information out of it?
> >
> > On Fri, Sep 12, 2014 at 10:11 AM, Devon H. O'Dell <devon.od...@gmail.com
> >
> > wrote:
> >>
> >> Are you able to share a core file?
> >>
> >> 2014-09-11 14:32 GMT-07:00 Sam Barham <s.bar...@adinstruments.com>:
> >> > We are using Ganglia to monitoring our cloud infrastructure on Amazon
> >> > AWS.
> >> > Everything is working correctly (metrics are flowing etc), except that
> >> > occasionally the gmetad process will segfault out of the blue. The
> >> > gmetad
> >> > process is running on an m3.medium EC2, and is monitoring about 50
> >> > servers.
> >> > The servers are arranged into groups, each one having a bastion EC2
> >> > where
> >> > the metrics are gathered. gmetad is configured to grab the metrics
> from
> >> > those bastions - about 10 of them.
> >> >
> >> > Some useful facts:
> >> >
> >> > We are running Debian Wheezy on all the EC2s
> >> > Sometimes the crash will happen multiple times in a day, sometimes
> it'll
> >> > be
> >> > a day or two before it crashes
> >> > The crash creates no logs in normal operation other than a segfault
> log
> >> > something like "gmetad[11291]: segfault at 71 ip 000000000040547c sp
> >> > 00007ff2d6572260 error 4 in gmetad[400000+e000]". If we run gmetad
> >> > manually
> >> > with debug logging, it appears that the crash is related to gmetad
> doing
> >> > a
> >> > cleanup.
> >> > When we realised that the cleanup process might be to blame we did
> more
> >> > research around that. We realised that our disk IO was way too high
> and
> >> > added rrdcached in order to reduce it. The disk IO is now much lower,
> >> > and
> >> > the crash is occurring less often, but still an average of once a day
> or
> >> > so.
> >> > We have two systems (dev and production). Both exhibit this crash, but
> >> > the
> >> > dev system, which is monitoring a much smaller group of servers
> crashes
> >> > significantly less often.
> >> > The production system is running ganglia 3.3.8-1+nmu1/rrdtool 1.4.7-2.
> >> > We've
> >> > upgraded ganglia in the dev systems to ganglia 3.6.0-2~bpo70+1/rrdtool
> >> > 1.4.7-2. That doesn't seem to have helped with the crash.
> >> > We have monit running on both systems configured to restart gmetad if
> it
> >> > dies. It restarts immediately with no issues.
> >> > The production system is storing it's data on a magnetic disk, the dev
> >> > system is using ssd.  That doesn't appear to have changed the
> frequency
> >> > of
> >> > the crash.
> >> >
> >> > Has anyone experienced this kind of crash, especially on Amazon
> >> > hardware?
> >> > We're at our wits end trying to find a solution!
> >> >
> >> >
> >> >
> >> >
> ------------------------------------------------------------------------------
> >> > Want excitement?
> >> > Manually upgrade your production database.
> >> > When you want reliability, choose Perforce
> >> > Perforce version control. Predictably reliable.
> >> >
> >> >
> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
> >> > _______________________________________________
> >> > Ganglia-general mailing list
> >> > Ganglia-general@lists.sourceforge.net
> >> > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> >> >
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > Want excitement?
> > Manually upgrade your production database.
> > When you want reliability, choose Perforce
> > Perforce version control. Predictably reliable.
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
> > _______________________________________________
> > Ganglia-general mailing list
> > Ganglia-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> >
>


On Tue, Sep 16, 2014 at 12:45 PM, Devon H. O'Dell <devon.od...@gmail.com>
wrote:

> If you can install the dbg or dbgsym package for this, you can get
> more information. If you cannot do this, running:
>
> objdump -d `which gmond` | less
>
> in less:
>
> /40547c
>
> Paste a little context of the disassembly before and after that
> address, then scroll up and paste which function it's in. (That might
> still be too little information or even bad information if the binary
> is stripped. But it's something.)
>
> --dho
>
> 2014-09-14 18:09 GMT-07:00 Sam Barham <s.bar...@adinstruments.com>:
> > I've finally managed to generate a core dump (the VM wasn't set up to do
> it
> > yet), but it's 214Mb and doesn't seem to contain anything helpful -
> > especially as I don't have debug symbols.  The backtrace shows:
> > #0  0x000000000040547c in ?? ()
> > #1  0x00007f600a49a245 in hash_foreach () from
> > /usr/lib/libganglia-3.3.8.so.0
> > #2  0x00000000004054e1 in ?? ()
> > #3  0x00007f600a49a245 in hash_foreach () from
> > /usr/lib/libganglia-3.3.8.so.0
> > #4  0x00000000004054e1 in ?? ()
> > #5  0x00007f600a49a245 in hash_foreach () from
> > /usr/lib/libganglia-3.3.8.so.0
> > #6  0x0000000000405436 in ?? ()
> > #7  0x000000000040530d in ?? ()
> > #8  0x00000000004058fa in ?? ()
> > #9  0x00007f6008ef9b50 in start_thread () from
> > /lib/x86_64-linux-gnu/libpthread.so.0
> > #10 0x00007f6008c43e6d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> > #11 0x0000000000000000 in ?? ()
> >
> > Is there a way for me to get more useful information out of it?
> >
> > On Fri, Sep 12, 2014 at 10:11 AM, Devon H. O'Dell <devon.od...@gmail.com
> >
> > wrote:
> >>
> >> Are you able to share a core file?
> >>
> >> 2014-09-11 14:32 GMT-07:00 Sam Barham <s.bar...@adinstruments.com>:
> >> > We are using Ganglia to monitoring our cloud infrastructure on Amazon
> >> > AWS.
> >> > Everything is working correctly (metrics are flowing etc), except that
> >> > occasionally the gmetad process will segfault out of the blue. The
> >> > gmetad
> >> > process is running on an m3.medium EC2, and is monitoring about 50
> >> > servers.
> >> > The servers are arranged into groups, each one having a bastion EC2
> >> > where
> >> > the metrics are gathered. gmetad is configured to grab the metrics
> from
> >> > those bastions - about 10 of them.
> >> >
> >> > Some useful facts:
> >> >
> >> > We are running Debian Wheezy on all the EC2s
> >> > Sometimes the crash will happen multiple times in a day, sometimes
> it'll
> >> > be
> >> > a day or two before it crashes
> >> > The crash creates no logs in normal operation other than a segfault
> log
> >> > something like "gmetad[11291]: segfault at 71 ip 000000000040547c sp
> >> > 00007ff2d6572260 error 4 in gmetad[400000+e000]". If we run gmetad
> >> > manually
> >> > with debug logging, it appears that the crash is related to gmetad
> doing
> >> > a
> >> > cleanup.
> >> > When we realised that the cleanup process might be to blame we did
> more
> >> > research around that. We realised that our disk IO was way too high
> and
> >> > added rrdcached in order to reduce it. The disk IO is now much lower,
> >> > and
> >> > the crash is occurring less often, but still an average of once a day
> or
> >> > so.
> >> > We have two systems (dev and production). Both exhibit this crash, but
> >> > the
> >> > dev system, which is monitoring a much smaller group of servers
> crashes
> >> > significantly less often.
> >> > The production system is running ganglia 3.3.8-1+nmu1/rrdtool 1.4.7-2.
> >> > We've
> >> > upgraded ganglia in the dev systems to ganglia 3.6.0-2~bpo70+1/rrdtool
> >> > 1.4.7-2. That doesn't seem to have helped with the crash.
> >> > We have monit running on both systems configured to restart gmetad if
> it
> >> > dies. It restarts immediately with no issues.
> >> > The production system is storing it's data on a magnetic disk, the dev
> >> > system is using ssd.  That doesn't appear to have changed the
> frequency
> >> > of
> >> > the crash.
> >> >
> >> > Has anyone experienced this kind of crash, especially on Amazon
> >> > hardware?
> >> > We're at our wits end trying to find a solution!
> >> >
> >> >
> >> >
> >> >
> ------------------------------------------------------------------------------
> >> > Want excitement?
> >> > Manually upgrade your production database.
> >> > When you want reliability, choose Perforce
> >> > Perforce version control. Predictably reliable.
> >> >
> >> >
> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
> >> > _______________________________________________
> >> > Ganglia-general mailing list
> >> > Ganglia-general@lists.sourceforge.net
> >> > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> >> >
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > Want excitement?
> > Manually upgrade your production database.
> > When you want reliability, choose Perforce
> > Perforce version control. Predictably reliable.
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
> > _______________________________________________
> > Ganglia-general mailing list
> > Ganglia-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> >
>

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk

_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)

Reply via email to