It's just a system versioning thing for shared libraries. Usually .so
is a soft link to .so.0 which is a soft link to .so.0.0.0. This is
intended to be an ABI versioning interface, but it's not super
frequently used. Are these legitimately different files on your
system?

The crash is in hash_delete:

0000003b2c00b780 <hash_delete>:
...
  3b2c00b797:   48 8b 07                mov    (%rdi),%rax
  3b2c00b79a:   48 8d 34 30             lea    (%rax,%rsi,1),%rsi
  3b2c00b79e:   48 39 f0                cmp    %rsi,%rax
  3b2c00b7a1:   73 37                   jae    3b2c00b7da <hash_delete+0x5a>
  3b2c00b7a3:   48 bf b3 01 00 00 00    movabs $0x100000001b3,%rdi
  3b2c00b7aa:   01 00 00
  3b2c00b7ad:   0f 1f 00                nopl   (%rax)
>>>  3b2c00b7b0:   0f b6 08                movzbl (%rax),%ecx
  3b2c00b7b3:   48 83 c0 01             add    $0x1,%rax
  3b2c00b7b7:   48 31 ca                xor    %rcx,%rdx
  3b2c00b7ba:   48 0f af d7             imul   %rdi,%rdx
  3b2c00b7be:   48 39 c6                cmp    %rax,%rsi
  3b2c00b7c1:   77 ed                   ja     3b2c00b7b0 <hash_delete+0x30>
...

%rdi is the first argument to the function, so %rax is the datum_t
*key, and (%rax) is key->data. hash_key has been inlined here.
Unfortunately, what appears to be happening is that some key has
already been removed from the hash table and freed, and based on your
description of the problem, that was attempted concurrently. Your
kernel crash shows that we were trying to dereference a NULL pointer,
so it would appear that key->data is NULL.

Unfortunately, it is not clear without a backtrace what sort of key
specifically is in question here, but perhaps someone else might have
some context based on recent changes. (I don't think this is related
to my work on the hashes).

Are you running any custom modules (either in C or Python)? Would it
be possible for you to build gmond and libganglia with debugging
symbols and generate a core dump?

--dho


2015-11-23 1:29 GMT-08:00 Cristovao Cordeiro <cristovao.corde...@cern.ch>:
> Hi Devon,
>
> thanks for the help.
> Attached follows the binary file.
>
> btw, what is the difference between so.0 and so.0.0.0?
>
> Cumprimentos / Best regards,
> Cristóvão José Domingues Cordeiro
>
>
> On 17 Nov 2015, at 19:16, Devon H. O'Dell <devon.od...@gmail.com> wrote:
>
> Hi! Very sorry about this, I had a draft that I thought I had sent.
>
> Could you email me your libganglia.so binary off-list? Alternatively,
> do you have the ability to compile libganglia with debugging symbols?
>
> 2015-11-17 1:56 GMT-08:00 Cristovao Cordeiro <cristovao.corde...@cern.ch>:
>
> Hi everyone,
>
> any news on this?
> Another symptom is that this happens quite as often as the cluster changes,
> meaning that the more activity there is in the cluster (delete machines,
> create...) the more this issue happens. Could it be related with the
> deletion of old hosts by gmond causing gmetad to try to access files that
> are already gone?
>
> Cumprimentos / Best regards,
> Cristóvão José Domingues Cordeiro
>
>
> ________________________________________
> From: Cristovao Cordeiro [cristovao.corde...@cern.ch]
> Sent: 09 November 2015 13:40
> To: Devon H. O'Dell
> Cc: Ganglia-general@lists.sourceforge.net
> Subject: Re: [Ganglia-general] gmetad segmentation fault
>
> Hi Devon,
>
> thanks!
>
> * I don't think there was a core dump. At least that is not stated in
> /var/log/messages and I don't find anything relevant in /var/spool/abrt/
> * I am running 3.7.1
> * The addr2line returns ??:0. Also with gdb:
>
> gdb /usr/lib64/libganglia.so.0.0.0
>
>   ...
>   Reading symbols from /usr/lib64/libganglia.so.0.0.0...(no debugging
> symbols found)...done.
>
> Some more information about my setup:
> - I am running several gmonds in the same machine, so all my data_sources
> are to localhost.
>
> Cumprimentos / Best regards,
> Cristóvão José Domingues Cordeiro
>
>
> ________________________________________
> From: Devon H. O'Dell [devon.od...@gmail.com]
> Sent: 09 November 2015 13:12
> To: Cristovao Cordeiro
> Cc: Ganglia-general@lists.sourceforge.net
> Subject: Re: [Ganglia-general] gmetad segmentation fault
>
> Hi!
>
> I have a couple of initial questions that might help figure out the problem:
>
> * Did you get a core dump?
> * What version of ganglia are you running?
> * This crash happened within libganglia.so at offset 0xb7b0. Can you run:
>
> $ addr2line -e /path/to/libganglia.so.0.0.0 0xb7b0
>
> and paste the output? If that does not work, there are a couple other
> things we can try to get information about the fault, but hopefully we
> can just work from there.
>
> Kind regards,
>
> Devon H. O'Dell
>
> 2015-11-09 0:13 GMT-08:00 Cristovao Cordeiro <cristovao.corde...@cern.ch>:
>
> Dear all,
>
> I have several Ganglia monitors running with similar configurations in
> different machines (VMs) and for a long time now I have been experiencing
> segmentation faults at random times. It seems to happen more on gmetads that
> are monitoring larger number of nodes.
>
> In /var/log/messages I see:
>
> kernel: gmetad[3948]: segfault at 0 ip 0000003630c0b7b0 sp 00007f0ecbffebc0
> error 4 in libganglia.so.0.0.0[3630c00000+15000]
>
>
> and in the console output there's only this:
>
> /bin/bash: line 1: 30375 Terminated              /usr/sbin/gmetad
>
>                                                           [FAILED]
>
>
> gmetad does not have any special configuration besides the RRD location
> which in on a 4Gb ramdisk.
>
>
> Cumprimentos / Best regards,
> Cristóvão José Domingues Cordeiro
>
>
> ------------------------------------------------------------------------------
> Presto, an open source distributed SQL query engine for big data, initially
> developed by Facebook, enables you to easily query your data on Hadoop in a
> more interactive manner. Teradata is also now providing full enterprise
> support for Presto. Download a free open source copy now.
> http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140
> _______________________________________________
> Ganglia-general mailing list
> Ganglia-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>
>
> ------------------------------------------------------------------------------
> Presto, an open source distributed SQL query engine for big data, initially
> developed by Facebook, enables you to easily query your data on Hadoop in a
> more interactive manner. Teradata is also now providing full enterprise
> support for Presto. Download a free open source copy now.
> http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140
> _______________________________________________
> Ganglia-general mailing list
> Ganglia-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/ganglia-general
>
>

------------------------------------------------------------------------------
Go from Idea to Many App Stores Faster with Intel(R) XDK
Give your users amazing mobile app experiences with Intel(R) XDK.
Use one codebase in this all-in-one HTML5 development environment.
Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs.
http://pubads.g.doubleclick.net/gampad/clk?id=254741551&iu=/4140
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to