Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)

2014-09-21 Thread Sam Barham
The debug build of 3.6.0 finally crashed over the weekend.  The backtrace
is:
#0  0x7f042e4ba38c in hash_insert (key=0x7f0425bcc440,
val=0x7f0425bcc430, hash=0x7239d0) at hash.c:233
#1  0x00408551 in startElement_METRIC (data=0x7f0425bcc770,
el=0x733930 "METRIC", attr=0x709270) at process_xml.c:677
#2  0x004092b2 in start (data=0x7f0425bcc770, el=0x733930 "METRIC",
attr=0x709270) at process_xml.c:1036
#3  0x7f042d55b5fb in ?? () from /lib/x86_64-linux-gnu/libexpat.so.1
#4  0x7f042d55c84e in ?? () from /lib/x86_64-linux-gnu/libexpat.so.1
#5  0x7f042d55e36e in ?? () from /lib/x86_64-linux-gnu/libexpat.so.1
#6  0x7f042d55eb1b in ?? () from /lib/x86_64-linux-gnu/libexpat.so.1
#7  0x7f042d560b5d in XML_ParseBuffer () from
/lib/x86_64-linux-gnu/libexpat.so.1
#8  0x00409953 in process_xml (d=0x618900,
buf=0x792360 "\n\n  \n  
wrote:

> Regardless of whether this is 3.3.8 or 3.6.0, the offending line is:
>
> WRITE_LOCK(hash, i);
>
> I was going to guess this was 3.6.0 because it's a different
> backtrace, however the line number in process_xml.c doesn't make sense
> unless it is 3.3.8. What this implies is that the hash table is not
> properly protected by its mutex.
>
> There are 339 commits between 3.3.8 and the current master branch. I'd
> like to heavily suggest updating because I unfortunately do not have
> time to look through all the commit messages to see if this has been
> solved by work others have done.
>
> --dho
>
--
Slashdot TV.  Video for Nerds.  Stuff that Matters.
http://pubads.g.doubleclick.net/gampad/clk?id=160591471&iu=/4140/ostg.clktrk___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)

2014-09-16 Thread Devon H. O'Dell
Regardless of whether this is 3.3.8 or 3.6.0, the offending line is:

WRITE_LOCK(hash, i);

I was going to guess this was 3.6.0 because it's a different
backtrace, however the line number in process_xml.c doesn't make sense
unless it is 3.3.8. What this implies is that the hash table is not
properly protected by its mutex.

There are 339 commits between 3.3.8 and the current master branch. I'd
like to heavily suggest updating because I unfortunately do not have
time to look through all the commit messages to see if this has been
solved by work others have done.

--dho

--
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce.
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)

2014-09-15 Thread Devon H. O'Dell
This is the prologue of some function  and the second argument is NULL when
it shouldn't be. Unfortunately, the binary does appear to be stripped,  so
it will be slightly hard to figure out which function it is. Your previous
email with the backtrace shows that it is walking the hash tree (probably
to aggregate), so it's possible that some probe is returning data that
can't be parsed or meaningfully interpreted. However, since it is a nested
walk, it might be possible to guess which metric is that deeply nested.

But not easily.

This also means running under gdb is probably pointless. Do you have the
ability to run a version with deciding symbols? If so, that is probably
faster for reaching a solution than I can surmise from digging the
assembler.
On Sep 15, 2014 6:57 PM, "Sam Barham"  wrote:

> I can't read assembly, so this doesn't mean much to me, but hopefully
> it'll mean something to you :)
>
>
>   40540e:   e9 fc fe ff ff  jmpq   40530f 
>   405413:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
>   405418:   48 89 demov%rbx,%rsi
>   40541b:   e8 b0 fd ff ff  callq  4051d0 
>   405420:   48 8b 7b 18 mov0x18(%rbx),%rdi
>   405424:   48 85 fftest   %rdi,%rdi
>   405427:   74 0d   je 405436 
>   405429:   4c 89 e2mov%r12,%rdx
>   40542c:   be 60 54 40 00  mov$0x405460,%esi
>   405431:   e8 ca d3 ff ff  callq  402800 
>   405436:   31 c0   xor%eax,%eax
>   405438:   e9 f8 fe ff ff  jmpq   405335 
>   40543d:   0f 1f 00nopl   (%rax)
>   405440:   31 c9   xor%ecx,%ecx
>   405442:   4c 89 eamov%r13,%rdx
>   405445:   31 f6   xor%esi,%esi
>   405447:   4c 89 e7mov%r12,%rdi
>   40544a:   4c 89 04 24 mov%r8,(%rsp)
>   40544e:   e8 3d fe ff ff  callq  405290 
>   405453:   4c 8b 04 24 mov(%rsp),%r8
>   405457:   89 c5   mov%eax,%ebp
>   405459:   eb ab   jmp405406 
>   40545b:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
>   405460:   48 89 6c 24 f0  mov%rbp,-0x10(%rsp)
>   405465:   4c 89 64 24 f8  mov%r12,-0x8(%rsp)
>   40546a:   49 89 fcmov%rdi,%r12
>   40546d:   48 89 5c 24 e8  mov%rbx,-0x18(%rsp)
>   405472:   48 83 ec 18 sub$0x18,%rsp
>   405476:   8b 7a 18mov0x18(%rdx),%edi
>   405479:   48 89 d5mov%rdx,%rbp
>   40547c:   48 8b 1emov(%rsi),%rbx
>   40547f:   85 ff   test   %edi,%edi
>   405481:   74 0c   je 40548f 
>   405483:   48 89 demov%rbx,%rsi
>   405486:   e8 15 fd ff ff  callq  4051a0 
>   40548b:   85 c0   test   %eax,%eax
>   40548d:   74 12   je 4054a1 
>   40548f:   31 c9   xor%ecx,%ecx
>   405491:   48 89 eamov%rbp,%rdx
>   405494:   4c 89 e6mov%r12,%rsi
>   405497:   48 89 dfmov%rbx,%rdi
>   40549a:   ff 53 08callq  *0x8(%rbx)
>   40549d:   85 c0   test   %eax,%eax
>   40549f:   74 1f   je 4054c0 
>   4054a1:   b8 01 00 00 00  mov$0x1,%eax
>   4054a6:   48 8b 1c 24 mov(%rsp),%rbx
>   4054aa:   48 8b 6c 24 08  mov0x8(%rsp),%rbp
>   4054af:   4c 8b 64 24 10  mov0x10(%rsp),%r12
>   4054b4:   48 83 c4 18 add$0x18,%rsp
>   4054b8:   c3  retq
>   4054b9:   0f 1f 80 00 00 00 00nopl   0x0(%rax)
>   4054c0:   48 89 efmov%rbp,%rdi
>   4054c3:   48 89 demov%rbx,%rsi
>   4054c6:   e8 05 fd ff ff  callq  4051d0 
>   4054cb:   48 8b 7b 18 mov0x18(%rbx),%rdi
>
>
> On Tue, Sep 16, 2014 at 12:45 PM, Devon H. O'Dell 
>  wrote:
>
>> If you can install the dbg or dbgsym package for this, you can get
>> more information. If you cannot do this, running:
>>
>> objdump -d `which gmond` | less
>>
>> in less:
>>
>> /40547c
>>
>> Paste a little context of the disassembly before and after that
>> address, then scroll up and paste which function it's in. (That might
>> still be too little information or even bad information if the binary
>> is stripped. But it's something.)
>>
>> --dho
>>
>> 2014-09-14 18:09 GMT-07:00 Sam Barham :
>> > I've finally managed to generate a core dump (the VM wasn't set up to
>> do it
>> > yet), but it's 214Mb and doesn't seem to contain anything helpful -
>> > especially as I don't have debug symbols.  The backtrace shows:
>> > 

Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)

2014-09-15 Thread Sam Barham
I can't read assembly, so this doesn't mean much to me, but hopefully it'll
mean something to you :)


  40540e:   e9 fc fe ff ff  jmpq   40530f 
  405413:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
  405418:   48 89 demov%rbx,%rsi
  40541b:   e8 b0 fd ff ff  callq  4051d0 
  405420:   48 8b 7b 18 mov0x18(%rbx),%rdi
  405424:   48 85 fftest   %rdi,%rdi
  405427:   74 0d   je 405436 
  405429:   4c 89 e2mov%r12,%rdx
  40542c:   be 60 54 40 00  mov$0x405460,%esi
  405431:   e8 ca d3 ff ff  callq  402800 
  405436:   31 c0   xor%eax,%eax
  405438:   e9 f8 fe ff ff  jmpq   405335 
  40543d:   0f 1f 00nopl   (%rax)
  405440:   31 c9   xor%ecx,%ecx
  405442:   4c 89 eamov%r13,%rdx
  405445:   31 f6   xor%esi,%esi
  405447:   4c 89 e7mov%r12,%rdi
  40544a:   4c 89 04 24 mov%r8,(%rsp)
  40544e:   e8 3d fe ff ff  callq  405290 
  405453:   4c 8b 04 24 mov(%rsp),%r8
  405457:   89 c5   mov%eax,%ebp
  405459:   eb ab   jmp405406 
  40545b:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
  405460:   48 89 6c 24 f0  mov%rbp,-0x10(%rsp)
  405465:   4c 89 64 24 f8  mov%r12,-0x8(%rsp)
  40546a:   49 89 fcmov%rdi,%r12
  40546d:   48 89 5c 24 e8  mov%rbx,-0x18(%rsp)
  405472:   48 83 ec 18 sub$0x18,%rsp
  405476:   8b 7a 18mov0x18(%rdx),%edi
  405479:   48 89 d5mov%rdx,%rbp
  40547c:   48 8b 1emov(%rsi),%rbx
  40547f:   85 ff   test   %edi,%edi
  405481:   74 0c   je 40548f 
  405483:   48 89 demov%rbx,%rsi
  405486:   e8 15 fd ff ff  callq  4051a0 
  40548b:   85 c0   test   %eax,%eax
  40548d:   74 12   je 4054a1 
  40548f:   31 c9   xor%ecx,%ecx
  405491:   48 89 eamov%rbp,%rdx
  405494:   4c 89 e6mov%r12,%rsi
  405497:   48 89 dfmov%rbx,%rdi
  40549a:   ff 53 08callq  *0x8(%rbx)
  40549d:   85 c0   test   %eax,%eax
  40549f:   74 1f   je 4054c0 
  4054a1:   b8 01 00 00 00  mov$0x1,%eax
  4054a6:   48 8b 1c 24 mov(%rsp),%rbx
  4054aa:   48 8b 6c 24 08  mov0x8(%rsp),%rbp
  4054af:   4c 8b 64 24 10  mov0x10(%rsp),%r12
  4054b4:   48 83 c4 18 add$0x18,%rsp
  4054b8:   c3  retq
  4054b9:   0f 1f 80 00 00 00 00nopl   0x0(%rax)
  4054c0:   48 89 efmov%rbp,%rdi
  4054c3:   48 89 demov%rbx,%rsi
  4054c6:   e8 05 fd ff ff  callq  4051d0 
  4054cb:   48 8b 7b 18 mov0x18(%rbx),%rdi


On Tue, Sep 16, 2014 at 12:45 PM, Devon H. O'Dell 
 wrote:

> If you can install the dbg or dbgsym package for this, you can get
> more information. If you cannot do this, running:
>
> objdump -d `which gmond` | less
>
> in less:
>
> /40547c
>
> Paste a little context of the disassembly before and after that
> address, then scroll up and paste which function it's in. (That might
> still be too little information or even bad information if the binary
> is stripped. But it's something.)
>
> --dho
>
> 2014-09-14 18:09 GMT-07:00 Sam Barham :
> > I've finally managed to generate a core dump (the VM wasn't set up to do
> it
> > yet), but it's 214Mb and doesn't seem to contain anything helpful -
> > especially as I don't have debug symbols.  The backtrace shows:
> > #0  0x0040547c in ?? ()
> > #1  0x7f600a49a245 in hash_foreach () from
> > /usr/lib/libganglia-3.3.8.so.0
> > #2  0x004054e1 in ?? ()
> > #3  0x7f600a49a245 in hash_foreach () from
> > /usr/lib/libganglia-3.3.8.so.0
> > #4  0x004054e1 in ?? ()
> > #5  0x7f600a49a245 in hash_foreach () from
> > /usr/lib/libganglia-3.3.8.so.0
> > #6  0x00405436 in ?? ()
> > #7  0x0040530d in ?? ()
> > #8  0x004058fa in ?? ()
> > #9  0x7f6008ef9b50 in start_thread () from
> > /lib/x86_64-linux-gnu/libpthread.so.0
> > #10 0x7f6008c43e6d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> > #11 0x in ?? ()
> >
> > Is there a way for me to get more useful information out of it?
> >
> > On Fri, Sep 12, 2014 at 10:11 AM, Devon H. O'Dell  >
> > wrote:
> >>
> >> Are you able to share a core file?
> >>
> >> 2014-09-11 14:32 GMT-07:00 Sam Barham :
> >> > We are using Ganglia to monito

Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)

2014-09-15 Thread Devon H. O'Dell
If you can install the dbg or dbgsym package for this, you can get
more information. If you cannot do this, running:

objdump -d `which gmond` | less

in less:

/40547c

Paste a little context of the disassembly before and after that
address, then scroll up and paste which function it's in. (That might
still be too little information or even bad information if the binary
is stripped. But it's something.)

--dho

2014-09-14 18:09 GMT-07:00 Sam Barham :
> I've finally managed to generate a core dump (the VM wasn't set up to do it
> yet), but it's 214Mb and doesn't seem to contain anything helpful -
> especially as I don't have debug symbols.  The backtrace shows:
> #0  0x0040547c in ?? ()
> #1  0x7f600a49a245 in hash_foreach () from
> /usr/lib/libganglia-3.3.8.so.0
> #2  0x004054e1 in ?? ()
> #3  0x7f600a49a245 in hash_foreach () from
> /usr/lib/libganglia-3.3.8.so.0
> #4  0x004054e1 in ?? ()
> #5  0x7f600a49a245 in hash_foreach () from
> /usr/lib/libganglia-3.3.8.so.0
> #6  0x00405436 in ?? ()
> #7  0x0040530d in ?? ()
> #8  0x004058fa in ?? ()
> #9  0x7f6008ef9b50 in start_thread () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #10 0x7f6008c43e6d in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #11 0x in ?? ()
>
> Is there a way for me to get more useful information out of it?
>
> On Fri, Sep 12, 2014 at 10:11 AM, Devon H. O'Dell 
> wrote:
>>
>> Are you able to share a core file?
>>
>> 2014-09-11 14:32 GMT-07:00 Sam Barham :
>> > We are using Ganglia to monitoring our cloud infrastructure on Amazon
>> > AWS.
>> > Everything is working correctly (metrics are flowing etc), except that
>> > occasionally the gmetad process will segfault out of the blue. The
>> > gmetad
>> > process is running on an m3.medium EC2, and is monitoring about 50
>> > servers.
>> > The servers are arranged into groups, each one having a bastion EC2
>> > where
>> > the metrics are gathered. gmetad is configured to grab the metrics from
>> > those bastions - about 10 of them.
>> >
>> > Some useful facts:
>> >
>> > We are running Debian Wheezy on all the EC2s
>> > Sometimes the crash will happen multiple times in a day, sometimes it'll
>> > be
>> > a day or two before it crashes
>> > The crash creates no logs in normal operation other than a segfault log
>> > something like "gmetad[11291]: segfault at 71 ip 0040547c sp
>> > 7ff2d6572260 error 4 in gmetad[40+e000]". If we run gmetad
>> > manually
>> > with debug logging, it appears that the crash is related to gmetad doing
>> > a
>> > cleanup.
>> > When we realised that the cleanup process might be to blame we did more
>> > research around that. We realised that our disk IO was way too high and
>> > added rrdcached in order to reduce it. The disk IO is now much lower,
>> > and
>> > the crash is occurring less often, but still an average of once a day or
>> > so.
>> > We have two systems (dev and production). Both exhibit this crash, but
>> > the
>> > dev system, which is monitoring a much smaller group of servers crashes
>> > significantly less often.
>> > The production system is running ganglia 3.3.8-1+nmu1/rrdtool 1.4.7-2.
>> > We've
>> > upgraded ganglia in the dev systems to ganglia 3.6.0-2~bpo70+1/rrdtool
>> > 1.4.7-2. That doesn't seem to have helped with the crash.
>> > We have monit running on both systems configured to restart gmetad if it
>> > dies. It restarts immediately with no issues.
>> > The production system is storing it's data on a magnetic disk, the dev
>> > system is using ssd.  That doesn't appear to have changed the frequency
>> > of
>> > the crash.
>> >
>> > Has anyone experienced this kind of crash, especially on Amazon
>> > hardware?
>> > We're at our wits end trying to find a solution!
>> >
>> >
>> >
>> > --
>> > Want excitement?
>> > Manually upgrade your production database.
>> > When you want reliability, choose Perforce
>> > Perforce version control. Predictably reliable.
>> >
>> > http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
>> > ___
>> > Ganglia-general mailing list
>> > Ganglia-general@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/ganglia-general
>> >
>
>
>
> --
> Want excitement?
> Manually upgrade your production database.
> When you want reliability, choose Perforce
> Perforce version control. Predictably reliable.
> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
> ___
> Ganglia-general mailing list
> Ganglia-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>

--
Want excitement?
Manually upgrade your producti

Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)

2014-09-14 Thread Sam Barham
I've finally managed to generate a core dump (the VM wasn't set up to do it
yet), but it's 214Mb and doesn't seem to contain anything helpful -
especially as I don't have debug symbols.  The backtrace shows:
#0  0x0040547c in ?? ()
#1  0x7f600a49a245 in hash_foreach () from
/usr/lib/libganglia-3.3.8.so.0
#2  0x004054e1 in ?? ()
#3  0x7f600a49a245 in hash_foreach () from
/usr/lib/libganglia-3.3.8.so.0
#4  0x004054e1 in ?? ()
#5  0x7f600a49a245 in hash_foreach () from
/usr/lib/libganglia-3.3.8.so.0
#6  0x00405436 in ?? ()
#7  0x0040530d in ?? ()
#8  0x004058fa in ?? ()
#9  0x7f6008ef9b50 in start_thread () from
/lib/x86_64-linux-gnu/libpthread.so.0
#10 0x7f6008c43e6d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x in ?? ()

Is there a way for me to get more useful information out of it?

On Fri, Sep 12, 2014 at 10:11 AM, Devon H. O'Dell 
wrote:

> Are you able to share a core file?
>
> 2014-09-11 14:32 GMT-07:00 Sam Barham :
> > We are using Ganglia to monitoring our cloud infrastructure on Amazon
> AWS.
> > Everything is working correctly (metrics are flowing etc), except that
> > occasionally the gmetad process will segfault out of the blue. The gmetad
> > process is running on an m3.medium EC2, and is monitoring about 50
> servers.
> > The servers are arranged into groups, each one having a bastion EC2 where
> > the metrics are gathered. gmetad is configured to grab the metrics from
> > those bastions - about 10 of them.
> >
> > Some useful facts:
> >
> > We are running Debian Wheezy on all the EC2s
> > Sometimes the crash will happen multiple times in a day, sometimes it'll
> be
> > a day or two before it crashes
> > The crash creates no logs in normal operation other than a segfault log
> > something like "gmetad[11291]: segfault at 71 ip 0040547c sp
> > 7ff2d6572260 error 4 in gmetad[40+e000]". If we run gmetad
> manually
> > with debug logging, it appears that the crash is related to gmetad doing
> a
> > cleanup.
> > When we realised that the cleanup process might be to blame we did more
> > research around that. We realised that our disk IO was way too high and
> > added rrdcached in order to reduce it. The disk IO is now much lower, and
> > the crash is occurring less often, but still an average of once a day or
> so.
> > We have two systems (dev and production). Both exhibit this crash, but
> the
> > dev system, which is monitoring a much smaller group of servers crashes
> > significantly less often.
> > The production system is running ganglia 3.3.8-1+nmu1/rrdtool 1.4.7-2.
> We've
> > upgraded ganglia in the dev systems to ganglia 3.6.0-2~bpo70+1/rrdtool
> > 1.4.7-2. That doesn't seem to have helped with the crash.
> > We have monit running on both systems configured to restart gmetad if it
> > dies. It restarts immediately with no issues.
> > The production system is storing it's data on a magnetic disk, the dev
> > system is using ssd.  That doesn't appear to have changed the frequency
> of
> > the crash.
> >
> > Has anyone experienced this kind of crash, especially on Amazon hardware?
> > We're at our wits end trying to find a solution!
> >
> >
> >
> --
> > Want excitement?
> > Manually upgrade your production database.
> > When you want reliability, choose Perforce
> > Perforce version control. Predictably reliable.
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
> > ___
> > Ganglia-general mailing list
> > Ganglia-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/ganglia-general
> >
>
--
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetad segfaults after running for a while (on AWS EC2)

2014-09-11 Thread Devon H. O'Dell
Are you able to share a core file?

2014-09-11 14:32 GMT-07:00 Sam Barham :
> We are using Ganglia to monitoring our cloud infrastructure on Amazon AWS.
> Everything is working correctly (metrics are flowing etc), except that
> occasionally the gmetad process will segfault out of the blue. The gmetad
> process is running on an m3.medium EC2, and is monitoring about 50 servers.
> The servers are arranged into groups, each one having a bastion EC2 where
> the metrics are gathered. gmetad is configured to grab the metrics from
> those bastions - about 10 of them.
>
> Some useful facts:
>
> We are running Debian Wheezy on all the EC2s
> Sometimes the crash will happen multiple times in a day, sometimes it'll be
> a day or two before it crashes
> The crash creates no logs in normal operation other than a segfault log
> something like "gmetad[11291]: segfault at 71 ip 0040547c sp
> 7ff2d6572260 error 4 in gmetad[40+e000]". If we run gmetad manually
> with debug logging, it appears that the crash is related to gmetad doing a
> cleanup.
> When we realised that the cleanup process might be to blame we did more
> research around that. We realised that our disk IO was way too high and
> added rrdcached in order to reduce it. The disk IO is now much lower, and
> the crash is occurring less often, but still an average of once a day or so.
> We have two systems (dev and production). Both exhibit this crash, but the
> dev system, which is monitoring a much smaller group of servers crashes
> significantly less often.
> The production system is running ganglia 3.3.8-1+nmu1/rrdtool 1.4.7-2. We've
> upgraded ganglia in the dev systems to ganglia 3.6.0-2~bpo70+1/rrdtool
> 1.4.7-2. That doesn't seem to have helped with the crash.
> We have monit running on both systems configured to restart gmetad if it
> dies. It restarts immediately with no issues.
> The production system is storing it's data on a magnetic disk, the dev
> system is using ssd.  That doesn't appear to have changed the frequency of
> the crash.
>
> Has anyone experienced this kind of crash, especially on Amazon hardware?
> We're at our wits end trying to find a solution!
>
>
> --
> Want excitement?
> Manually upgrade your production database.
> When you want reliability, choose Perforce
> Perforce version control. Predictably reliable.
> http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
> ___
> Ganglia-general mailing list
> Ganglia-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>

--
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general