Re: [Ganglia-general] gmetad segmentation fault

Devon H. O'Dell Fri, 11 Dec 2015 07:20:47 -0800

Unfortunately, without a coredump or backtrace where debug symbols are
present, I'm not going to be able to offer any additional insight.


Are you running any C and / or Python modules with gmetad?

--dho

2015-12-11 5:54 GMT-08:00 Cristovao Cordeiro <[email protected]>:
> Hi guys,
>
> just to update on this:
>  - I've removed my ganglia-gmetad/gmond and libganglia from everywhere and
> installed the most recent versions from the epel repository. The error is
> still there.
>
> Cumprimentos / Best regards,
> Cristóvão José Domingues Cordeiro
>
> ________________________________
> From: Cristovao Cordeiro
> Sent: 08 December 2015 11:49
> To: Marcello Morgotti
> Cc: [email protected]
>
> Subject: Re: [Ganglia-general] gmetad segmentation fault
>
> Hi everyone,
>
> sorry for the late reply.
> @Devon
> thanks for looking into it.
> i do have .so.0 and .so.0.0.0 in my system and I am not using any custom
> modules. The Ganglia deployment is however a bit different from the
> standard:
>   - in one single VM, gmetad is running (always) and several gmond daemons
> are running in the background (daemon gmond -c /etc/ganglia/gmond_N.conf),
> all receiving metric through unicast.
> The Ganglia package is built by me as well, from the source code. I am
> currently building and using Ganglia 3.7.1 (taken from
> http://sourceforge.net/projects/ganglia/files/ganglia%20monitoring%20core/3.7.1/).
> I build the Ganglia RPM myself for 2 reasons:
> 1 - have Ganglia available in YUM
> 2 - minor changes to ganglia-web's apache.conf
>
> I have other monitors running 3.6.0 and no errors there. But on those I have
> installed Ganglia manually and directly without building a RPM.
>
> I also see 3.7.2 already available in the epel repository so I’ll might try
> this.
>
> Regarding the compilation with debug symbols…
>
> @Marcello
> did you get a chance to do it?
>
>
> Best regards,
> Cristóvão José Domingues Cordeiro
>
>
>
>
> On 24 Nov 2015, at 18:51, Marcello Morgotti <[email protected]> wrote:
>
> Hello,
>
> I'd like to join the discussion because this problem is affecting us as
> well. We have the problem on two different installations:
>
> 2 server in active-active HA configuration, each with CentOS 7.1 +
> ganglia 3.7.2 + rrdcached monitoring systems A,B,C,D
> 2 server in active-active HA configuration, each with RedHat 6.5 +
> ganglia 3.7.2 + rrdcached  monitoring systems E,F,G,H
>
> In both cases the ganglia rpm packages are taken from EPEL repository.
> The curios thing is that every time that the segfault happens it happens
> almost at the same time.
> I.e. for Centos7 systems:
>
> Nov 15 12:27:35 rp02 kernel: traps: gmetad[2620] general protection
> ip:7fd70d62f82c sp:7fd6fdcb3af0 error:0 in
> libganglia.so.0.0.0[7fd70d624000+14000]
> Nov 15 12:27:35 rp02 systemd: gmetad.service: main process exited,
> code=killed, status=11/SEGV
> Nov 15 12:27:35 rp02 systemd: Unit gmetad.service entered failed state.
> Nov 15 12:27:41 rp01 kernel: traps: gmetad[6977] general protection
> ip:7fc1bdde582c sp:7fc1ae469af0 error:0 in
> libganglia.so.0.0.0[7fc1bddda000+14000]
> Nov 15 12:27:41 rp01 systemd: gmetad.service: main process exited,
> code=killed, status=11/SEGV
> Nov 15 12:27:41 rp01 systemd: Unit gmetad.service entered failed state.
>
>
> Hope this helps and adds infomations, I will try to build a debug
> version of gmetad to see if it's possible to generate a core dump.
>
> Best Regards,
> Marcello
>
> On 23/11/2015 17:30, Devon H. O'Dell wrote:
>
> It's just a system versioning thing for shared libraries. Usually .so
> is a soft link to .so.0 which is a soft link to .so.0.0.0. This is
> intended to be an ABI versioning interface, but it's not super
> frequently used. Are these legitimately different files on your
> system?
>
> The crash is in hash_delete:
>
> 0000003b2c00b780 <hash_delete>:
> ...
>   3b2c00b797:   48 8b 07                mov    (%rdi),%rax
>   3b2c00b79a:   48 8d 34 30             lea    (%rax,%rsi,1),%rsi
>   3b2c00b79e:   48 39 f0                cmp    %rsi,%rax
>   3b2c00b7a1:   73 37                   jae    3b2c00b7da <hash_delete+0x5a>
>   3b2c00b7a3:   48 bf b3 01 00 00 00    movabs $0x100000001b3,%rdi
>   3b2c00b7aa:   01 00 00
>   3b2c00b7ad:   0f 1f 00                nopl   (%rax)
>
>  3b2c00b7b0:   0f b6 08                movzbl (%rax),%ecx
>
>   3b2c00b7b3:   48 83 c0 01             add    $0x1,%rax
>   3b2c00b7b7:   48 31 ca                xor    %rcx,%rdx
>   3b2c00b7ba:   48 0f af d7             imul   %rdi,%rdx
>   3b2c00b7be:   48 39 c6                cmp    %rax,%rsi
>   3b2c00b7c1:   77 ed                   ja     3b2c00b7b0 <hash_delete+0x30>
> ...
>
> %rdi is the first argument to the function, so %rax is the datum_t
> *key, and (%rax) is key->data. hash_key has been inlined here.
> Unfortunately, what appears to be happening is that some key has
> already been removed from the hash table and freed, and based on your
> description of the problem, that was attempted concurrently. Your
> kernel crash shows that we were trying to dereference a NULL pointer,
> so it would appear that key->data is NULL.
>
> Unfortunately, it is not clear without a backtrace what sort of key
> specifically is in question here, but perhaps someone else might have
> some context based on recent changes. (I don't think this is related
> to my work on the hashes).
>
> Are you running any custom modules (either in C or Python)? Would it
> be possible for you to build gmond and libganglia with debugging
> symbols and generate a core dump?
>
> --dho
>
>
> 2015-11-23 1:29 GMT-08:00 Cristovao Cordeiro <[email protected]>:
>
> Hi Devon,
>
> thanks for the help.
> Attached follows the binary file.
>
> btw, what is the difference between so.0 and so.0.0.0?
>
> Cumprimentos / Best regards,
> Cristóvão José Domingues Cordeiro
>
>
> On 17 Nov 2015, at 19:16, Devon H. O'Dell <[email protected]> wrote:
>
> Hi! Very sorry about this, I had a draft that I thought I had sent.
>
> Could you email me your libganglia.so binary off-list? Alternatively,
> do you have the ability to compile libganglia with debugging symbols?
>
> 2015-11-17 1:56 GMT-08:00 Cristovao Cordeiro <[email protected]>:
>
> Hi everyone,
>
> any news on this?
> Another symptom is that this happens quite as often as the cluster changes,
> meaning that the more activity there is in the cluster (delete machines,
> create...) the more this issue happens. Could it be related with the
> deletion of old hosts by gmond causing gmetad to try to access files that
> are already gone?
>
> Cumprimentos / Best regards,
> Cristóvão José Domingues Cordeiro
>
>
> ________________________________________
> From: Cristovao Cordeiro [[email protected]]
> Sent: 09 November 2015 13:40
> To: Devon H. O'Dell
> Cc: [email protected]
> Subject: Re: [Ganglia-general] gmetad segmentation fault
>
> Hi Devon,
>
> thanks!
>
> * I don't think there was a core dump. At least that is not stated in
> /var/log/messages and I don't find anything relevant in /var/spool/abrt/
> * I am running 3.7.1
> * The addr2line returns ??:0. Also with gdb:
>
> gdb /usr/lib64/libganglia.so.0.0.0
>
>   ...
>   Reading symbols from /usr/lib64/libganglia.so.0.0.0...(no debugging
> symbols found)...done.
>
> Some more information about my setup:
> - I am running several gmonds in the same machine, so all my data_sources
> are to localhost.
>
> Cumprimentos / Best regards,
> Cristóvão José Domingues Cordeiro
>
>
> ________________________________________
> From: Devon H. O'Dell [[email protected]]
> Sent: 09 November 2015 13:12
> To: Cristovao Cordeiro
> Cc: [email protected]
> Subject: Re: [Ganglia-general] gmetad segmentation fault
>
> Hi!
>
> I have a couple of initial questions that might help figure out the problem:
>
> * Did you get a core dump?
> * What version of ganglia are you running?
> * This crash happened within libganglia.so at offset 0xb7b0. Can you run:
>
> $ addr2line -e /path/to/libganglia.so.0.0.0 0xb7b0
>
> and paste the output? If that does not work, there are a couple other
> things we can try to get information about the fault, but hopefully we
> can just work from there.
>
> Kind regards,
>
> Devon H. O'Dell
>
> 2015-11-09 0:13 GMT-08:00 Cristovao Cordeiro <[email protected]>:
>
> Dear all,
>
> I have several Ganglia monitors running with similar configurations in
> different machines (VMs) and for a long time now I have been experiencing
> segmentation faults at random times. It seems to happen more on gmetads that
> are monitoring larger number of nodes.
>
> In /var/log/messages I see:
>
> kernel: gmetad[3948]: segfault at 0 ip 0000003630c0b7b0 sp 00007f0ecbffebc0
> error 4 in libganglia.so.0.0.0[3630c00000+15000]
>
>
> and in the console output there's only this:
>
> /bin/bash: line 1: 30375 Terminated              /usr/sbin/gmetad
>
>                                                           [FAILED]
>
>
> gmetad does not have any special configuration besides the RRD location
> which in on a 4Gb ramdisk.
>
>
> Cumprimentos / Best regards,
> Cristóvão José Domingues Cordeiro
>
>
> ------------------------------------------------------------------------------
> Presto, an open source distributed SQL query engine for big data, initially
> developed by Facebook, enables you to easily query your data on Hadoop in a
> more interactive manner. Teradata is also now providing full enterprise
> support for Presto. Download a free open source copy now.
> http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140
> _______________________________________________
> Ganglia-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>
>
> ------------------------------------------------------------------------------
> Presto, an open source distributed SQL query engine for big data, initially
> developed by Facebook, enables you to easily query your data on Hadoop in a
> more interactive manner. Teradata is also now providing full enterprise
> support for Presto. Download a free open source copy now.
> http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140
> _______________________________________________
> Ganglia-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>
>
> ------------------------------------------------------------------------------
> Go from Idea to Many App Stores Faster with Intel(R) XDK
> Give your users amazing mobile app experiences with Intel(R) XDK.
> Use one codebase in this all-in-one HTML5 development environment.
> Design, debug & build mobile apps & 2D/3D high-impact games for multiple
> OSs.
> http://pubads.g.doubleclick.net/gampad/clk?id=254741551&iu=/4140
> _______________________________________________
> Ganglia-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>
>
> --
> Marcello Morgotti
> System and Technologies Department
> CINECA - Via Magnanelli 6/3, 40033 Casalecchio di Reno (Bologna)-Italy
> Tel:  +39 051 6171589 Fax: +39 051 6132198
> email: [email protected]
> http://www.cineca.it
>
>
> ------------------------------------------------------------------------------
> Go from Idea to Many App Stores Faster with Intel(R) XDK
> Give your users amazing mobile app experiences with Intel(R) XDK.
> Use one codebase in this all-in-one HTML5 development environment.
> Design, debug & build mobile apps & 2D/3D high-impact games for multiple
> OSs.
> http://pubads.g.doubleclick.net/gampad/clk?id=254741551&iu=/4140
> _______________________________________________
> Ganglia-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Ganglia-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>

------------------------------------------------------------------------------
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] gmetad segmentation fault

Reply via email to