Unfortunately, without a coredump or backtrace where debug symbols are present, I'm not going to be able to offer any additional insight.
Are you running any C and / or Python modules with gmetad? --dho 2015-12-11 5:54 GMT-08:00 Cristovao Cordeiro <[email protected]>: > Hi guys, > > just to update on this: > - I've removed my ganglia-gmetad/gmond and libganglia from everywhere and > installed the most recent versions from the epel repository. The error is > still there. > > Cumprimentos / Best regards, > Cristóvão José Domingues Cordeiro > > ________________________________ > From: Cristovao Cordeiro > Sent: 08 December 2015 11:49 > To: Marcello Morgotti > Cc: [email protected] > > Subject: Re: [Ganglia-general] gmetad segmentation fault > > Hi everyone, > > sorry for the late reply. > @Devon > thanks for looking into it. > i do have .so.0 and .so.0.0.0 in my system and I am not using any custom > modules. The Ganglia deployment is however a bit different from the > standard: > - in one single VM, gmetad is running (always) and several gmond daemons > are running in the background (daemon gmond -c /etc/ganglia/gmond_N.conf), > all receiving metric through unicast. > The Ganglia package is built by me as well, from the source code. I am > currently building and using Ganglia 3.7.1 (taken from > http://sourceforge.net/projects/ganglia/files/ganglia%20monitoring%20core/3.7.1/). > I build the Ganglia RPM myself for 2 reasons: > 1 - have Ganglia available in YUM > 2 - minor changes to ganglia-web's apache.conf > > I have other monitors running 3.6.0 and no errors there. But on those I have > installed Ganglia manually and directly without building a RPM. > > I also see 3.7.2 already available in the epel repository so I’ll might try > this. > > Regarding the compilation with debug symbols… > > @Marcello > did you get a chance to do it? > > > Best regards, > Cristóvão José Domingues Cordeiro > > > > > On 24 Nov 2015, at 18:51, Marcello Morgotti <[email protected]> wrote: > > Hello, > > I'd like to join the discussion because this problem is affecting us as > well. We have the problem on two different installations: > > 2 server in active-active HA configuration, each with CentOS 7.1 + > ganglia 3.7.2 + rrdcached monitoring systems A,B,C,D > 2 server in active-active HA configuration, each with RedHat 6.5 + > ganglia 3.7.2 + rrdcached monitoring systems E,F,G,H > > In both cases the ganglia rpm packages are taken from EPEL repository. > The curios thing is that every time that the segfault happens it happens > almost at the same time. > I.e. for Centos7 systems: > > Nov 15 12:27:35 rp02 kernel: traps: gmetad[2620] general protection > ip:7fd70d62f82c sp:7fd6fdcb3af0 error:0 in > libganglia.so.0.0.0[7fd70d624000+14000] > Nov 15 12:27:35 rp02 systemd: gmetad.service: main process exited, > code=killed, status=11/SEGV > Nov 15 12:27:35 rp02 systemd: Unit gmetad.service entered failed state. > Nov 15 12:27:41 rp01 kernel: traps: gmetad[6977] general protection > ip:7fc1bdde582c sp:7fc1ae469af0 error:0 in > libganglia.so.0.0.0[7fc1bddda000+14000] > Nov 15 12:27:41 rp01 systemd: gmetad.service: main process exited, > code=killed, status=11/SEGV > Nov 15 12:27:41 rp01 systemd: Unit gmetad.service entered failed state. > > > Hope this helps and adds infomations, I will try to build a debug > version of gmetad to see if it's possible to generate a core dump. > > Best Regards, > Marcello > > On 23/11/2015 17:30, Devon H. O'Dell wrote: > > It's just a system versioning thing for shared libraries. Usually .so > is a soft link to .so.0 which is a soft link to .so.0.0.0. This is > intended to be an ABI versioning interface, but it's not super > frequently used. Are these legitimately different files on your > system? > > The crash is in hash_delete: > > 0000003b2c00b780 <hash_delete>: > ... > 3b2c00b797: 48 8b 07 mov (%rdi),%rax > 3b2c00b79a: 48 8d 34 30 lea (%rax,%rsi,1),%rsi > 3b2c00b79e: 48 39 f0 cmp %rsi,%rax > 3b2c00b7a1: 73 37 jae 3b2c00b7da <hash_delete+0x5a> > 3b2c00b7a3: 48 bf b3 01 00 00 00 movabs $0x100000001b3,%rdi > 3b2c00b7aa: 01 00 00 > 3b2c00b7ad: 0f 1f 00 nopl (%rax) > > 3b2c00b7b0: 0f b6 08 movzbl (%rax),%ecx > > 3b2c00b7b3: 48 83 c0 01 add $0x1,%rax > 3b2c00b7b7: 48 31 ca xor %rcx,%rdx > 3b2c00b7ba: 48 0f af d7 imul %rdi,%rdx > 3b2c00b7be: 48 39 c6 cmp %rax,%rsi > 3b2c00b7c1: 77 ed ja 3b2c00b7b0 <hash_delete+0x30> > ... > > %rdi is the first argument to the function, so %rax is the datum_t > *key, and (%rax) is key->data. hash_key has been inlined here. > Unfortunately, what appears to be happening is that some key has > already been removed from the hash table and freed, and based on your > description of the problem, that was attempted concurrently. Your > kernel crash shows that we were trying to dereference a NULL pointer, > so it would appear that key->data is NULL. > > Unfortunately, it is not clear without a backtrace what sort of key > specifically is in question here, but perhaps someone else might have > some context based on recent changes. (I don't think this is related > to my work on the hashes). > > Are you running any custom modules (either in C or Python)? Would it > be possible for you to build gmond and libganglia with debugging > symbols and generate a core dump? > > --dho > > > 2015-11-23 1:29 GMT-08:00 Cristovao Cordeiro <[email protected]>: > > Hi Devon, > > thanks for the help. > Attached follows the binary file. > > btw, what is the difference between so.0 and so.0.0.0? > > Cumprimentos / Best regards, > Cristóvão José Domingues Cordeiro > > > On 17 Nov 2015, at 19:16, Devon H. O'Dell <[email protected]> wrote: > > Hi! Very sorry about this, I had a draft that I thought I had sent. > > Could you email me your libganglia.so binary off-list? Alternatively, > do you have the ability to compile libganglia with debugging symbols? > > 2015-11-17 1:56 GMT-08:00 Cristovao Cordeiro <[email protected]>: > > Hi everyone, > > any news on this? > Another symptom is that this happens quite as often as the cluster changes, > meaning that the more activity there is in the cluster (delete machines, > create...) the more this issue happens. Could it be related with the > deletion of old hosts by gmond causing gmetad to try to access files that > are already gone? > > Cumprimentos / Best regards, > Cristóvão José Domingues Cordeiro > > > ________________________________________ > From: Cristovao Cordeiro [[email protected]] > Sent: 09 November 2015 13:40 > To: Devon H. O'Dell > Cc: [email protected] > Subject: Re: [Ganglia-general] gmetad segmentation fault > > Hi Devon, > > thanks! > > * I don't think there was a core dump. At least that is not stated in > /var/log/messages and I don't find anything relevant in /var/spool/abrt/ > * I am running 3.7.1 > * The addr2line returns ??:0. Also with gdb: > > gdb /usr/lib64/libganglia.so.0.0.0 > > ... > Reading symbols from /usr/lib64/libganglia.so.0.0.0...(no debugging > symbols found)...done. > > Some more information about my setup: > - I am running several gmonds in the same machine, so all my data_sources > are to localhost. > > Cumprimentos / Best regards, > Cristóvão José Domingues Cordeiro > > > ________________________________________ > From: Devon H. O'Dell [[email protected]] > Sent: 09 November 2015 13:12 > To: Cristovao Cordeiro > Cc: [email protected] > Subject: Re: [Ganglia-general] gmetad segmentation fault > > Hi! > > I have a couple of initial questions that might help figure out the problem: > > * Did you get a core dump? > * What version of ganglia are you running? > * This crash happened within libganglia.so at offset 0xb7b0. Can you run: > > $ addr2line -e /path/to/libganglia.so.0.0.0 0xb7b0 > > and paste the output? If that does not work, there are a couple other > things we can try to get information about the fault, but hopefully we > can just work from there. > > Kind regards, > > Devon H. O'Dell > > 2015-11-09 0:13 GMT-08:00 Cristovao Cordeiro <[email protected]>: > > Dear all, > > I have several Ganglia monitors running with similar configurations in > different machines (VMs) and for a long time now I have been experiencing > segmentation faults at random times. It seems to happen more on gmetads that > are monitoring larger number of nodes. > > In /var/log/messages I see: > > kernel: gmetad[3948]: segfault at 0 ip 0000003630c0b7b0 sp 00007f0ecbffebc0 > error 4 in libganglia.so.0.0.0[3630c00000+15000] > > > and in the console output there's only this: > > /bin/bash: line 1: 30375 Terminated /usr/sbin/gmetad > > [FAILED] > > > gmetad does not have any special configuration besides the RRD location > which in on a 4Gb ramdisk. > > > Cumprimentos / Best regards, > Cristóvão José Domingues Cordeiro > > > ------------------------------------------------------------------------------ > Presto, an open source distributed SQL query engine for big data, initially > developed by Facebook, enables you to easily query your data on Hadoop in a > more interactive manner. Teradata is also now providing full enterprise > support for Presto. Download a free open source copy now. > http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140 > _______________________________________________ > Ganglia-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/ganglia-general > > > ------------------------------------------------------------------------------ > Presto, an open source distributed SQL query engine for big data, initially > developed by Facebook, enables you to easily query your data on Hadoop in a > more interactive manner. Teradata is also now providing full enterprise > support for Presto. Download a free open source copy now. > http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140 > _______________________________________________ > Ganglia-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/ganglia-general > > > ------------------------------------------------------------------------------ > Go from Idea to Many App Stores Faster with Intel(R) XDK > Give your users amazing mobile app experiences with Intel(R) XDK. > Use one codebase in this all-in-one HTML5 development environment. > Design, debug & build mobile apps & 2D/3D high-impact games for multiple > OSs. > http://pubads.g.doubleclick.net/gampad/clk?id=254741551&iu=/4140 > _______________________________________________ > Ganglia-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/ganglia-general > > > -- > Marcello Morgotti > System and Technologies Department > CINECA - Via Magnanelli 6/3, 40033 Casalecchio di Reno (Bologna)-Italy > Tel: +39 051 6171589 Fax: +39 051 6132198 > email: [email protected] > http://www.cineca.it > > > ------------------------------------------------------------------------------ > Go from Idea to Many App Stores Faster with Intel(R) XDK > Give your users amazing mobile app experiences with Intel(R) XDK. > Use one codebase in this all-in-one HTML5 development environment. > Design, debug & build mobile apps & 2D/3D high-impact games for multiple > OSs. > http://pubads.g.doubleclick.net/gampad/clk?id=254741551&iu=/4140 > _______________________________________________ > Ganglia-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/ganglia-general > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Ganglia-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/ganglia-general > ------------------------------------------------------------------------------ _______________________________________________ Ganglia-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-general

