Re: [Ganglia-general] gmetad segmentation fault
Unfortunately, without a coredump or backtrace where debug symbols are present, I'm not going to be able to offer any additional insight. Are you running any C and / or Python modules with gmetad? --dho 2015-12-11 5:54 GMT-08:00 Cristovao Cordeiro : > Hi guys, > > just to update on this: > - I've removed my ganglia-gmetad/gmond and libganglia from everywhere and > installed the most recent versions from the epel repository. The error is > still there. > > Cumprimentos / Best regards, > Cristóvão José Domingues Cordeiro > > > From: Cristovao Cordeiro > Sent: 08 December 2015 11:49 > To: Marcello Morgotti > Cc: ganglia-general@lists.sourceforge.net > > Subject: Re: [Ganglia-general] gmetad segmentation fault > > Hi everyone, > > sorry for the late reply. > @Devon > thanks for looking into it. > i do have .so.0 and .so.0.0.0 in my system and I am not using any custom > modules. The Ganglia deployment is however a bit different from the > standard: > - in one single VM, gmetad is running (always) and several gmond daemons > are running in the background (daemon gmond -c /etc/ganglia/gmond_N.conf), > all receiving metric through unicast. > The Ganglia package is built by me as well, from the source code. I am > currently building and using Ganglia 3.7.1 (taken from > http://sourceforge.net/projects/ganglia/files/ganglia%20monitoring%20core/3.7.1/). > I build the Ganglia RPM myself for 2 reasons: > 1 - have Ganglia available in YUM > 2 - minor changes to ganglia-web's apache.conf > > I have other monitors running 3.6.0 and no errors there. But on those I have > installed Ganglia manually and directly without building a RPM. > > I also see 3.7.2 already available in the epel repository so I’ll might try > this. > > Regarding the compilation with debug symbols… > > @Marcello > did you get a chance to do it? > > > Best regards, > Cristóvão José Domingues Cordeiro > > > > > On 24 Nov 2015, at 18:51, Marcello Morgotti wrote: > > Hello, > > I'd like to join the discussion because this problem is affecting us as > well. We have the problem on two different installations: > > 2 server in active-active HA configuration, each with CentOS 7.1 + > ganglia 3.7.2 + rrdcached monitoring systems A,B,C,D > 2 server in active-active HA configuration, each with RedHat 6.5 + > ganglia 3.7.2 + rrdcached monitoring systems E,F,G,H > > In both cases the ganglia rpm packages are taken from EPEL repository. > The curios thing is that every time that the segfault happens it happens > almost at the same time. > I.e. for Centos7 systems: > > Nov 15 12:27:35 rp02 kernel: traps: gmetad[2620] general protection > ip:7fd70d62f82c sp:7fd6fdcb3af0 error:0 in > libganglia.so.0.0.0[7fd70d624000+14000] > Nov 15 12:27:35 rp02 systemd: gmetad.service: main process exited, > code=killed, status=11/SEGV > Nov 15 12:27:35 rp02 systemd: Unit gmetad.service entered failed state. > Nov 15 12:27:41 rp01 kernel: traps: gmetad[6977] general protection > ip:7fc1bdde582c sp:7fc1ae469af0 error:0 in > libganglia.so.0.0.0[7fc1bddda000+14000] > Nov 15 12:27:41 rp01 systemd: gmetad.service: main process exited, > code=killed, status=11/SEGV > Nov 15 12:27:41 rp01 systemd: Unit gmetad.service entered failed state. > > > Hope this helps and adds infomations, I will try to build a debug > version of gmetad to see if it's possible to generate a core dump. > > Best Regards, > Marcello > > On 23/11/2015 17:30, Devon H. O'Dell wrote: > > It's just a system versioning thing for shared libraries. Usually .so > is a soft link to .so.0 which is a soft link to .so.0.0.0. This is > intended to be an ABI versioning interface, but it's not super > frequently used. Are these legitimately different files on your > system? > > The crash is in hash_delete: > > 003b2c00b780 : > ... > 3b2c00b797: 48 8b 07mov(%rdi),%rax > 3b2c00b79a: 48 8d 34 30 lea(%rax,%rsi,1),%rsi > 3b2c00b79e: 48 39 f0cmp%rsi,%rax > 3b2c00b7a1: 73 37 jae3b2c00b7da > 3b2c00b7a3: 48 bf b3 01 00 00 00movabs $0x10001b3,%rdi > 3b2c00b7aa: 01 00 00 > 3b2c00b7ad: 0f 1f 00nopl (%rax) > > 3b2c00b7b0: 0f b6 08movzbl (%rax),%ecx > > 3b2c00b7b3: 48 83 c0 01 add$0x1,%rax > 3b2c00b7b7: 48 31 caxor%rcx,%rdx > 3b2c00b7ba: 48 0f af d7 imul %rdi,%rdx > 3b2c00b7be: 48 39 c6cmp%rax,%rsi > 3b2c00b7c1: 77 ed ja 3b2c00b7b0 > ... > > %rdi i
Re: [Ganglia-general] gmetad segmentation fault
Hi guys, just to update on this: - I've removed my ganglia-gmetad/gmond and libganglia from everywhere and installed the most recent versions from the epel repository. The error is still there. Cumprimentos / Best regards, Cristóvão José Domingues Cordeiro From: Cristovao Cordeiro Sent: 08 December 2015 11:49 To: Marcello Morgotti Cc: ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] gmetad segmentation fault Hi everyone, sorry for the late reply. @Devon thanks for looking into it. i do have .so.0 and .so.0.0.0 in my system and I am not using any custom modules. The Ganglia deployment is however a bit different from the standard: - in one single VM, gmetad is running (always) and several gmond daemons are running in the background (daemon gmond -c /etc/ganglia/gmond_N.conf), all receiving metric through unicast. The Ganglia package is built by me as well, from the source code. I am currently building and using Ganglia 3.7.1 (taken from http://sourceforge.net/projects/ganglia/files/ganglia%20monitoring%20core/3.7.1/). I build the Ganglia RPM myself for 2 reasons: 1 - have Ganglia available in YUM 2 - minor changes to ganglia-web's apache.conf I have other monitors running 3.6.0 and no errors there. But on those I have installed Ganglia manually and directly without building a RPM. I also see 3.7.2 already available in the epel repository so I’ll might try this. Regarding the compilation with debug symbols… @Marcello did you get a chance to do it? Best regards, Cristóvão José Domingues Cordeiro On 24 Nov 2015, at 18:51, Marcello Morgotti mailto:m.morgo...@cineca.it>> wrote: Hello, I'd like to join the discussion because this problem is affecting us as well. We have the problem on two different installations: 2 server in active-active HA configuration, each with CentOS 7.1 + ganglia 3.7.2 + rrdcached monitoring systems A,B,C,D 2 server in active-active HA configuration, each with RedHat 6.5 + ganglia 3.7.2 + rrdcached monitoring systems E,F,G,H In both cases the ganglia rpm packages are taken from EPEL repository. The curios thing is that every time that the segfault happens it happens almost at the same time. I.e. for Centos7 systems: Nov 15 12:27:35 rp02 kernel: traps: gmetad[2620] general protection ip:7fd70d62f82c sp:7fd6fdcb3af0 error:0 in libganglia.so.0.0.0[7fd70d624000+14000] Nov 15 12:27:35 rp02 systemd: gmetad.service: main process exited, code=killed, status=11/SEGV Nov 15 12:27:35 rp02 systemd: Unit gmetad.service entered failed state. Nov 15 12:27:41 rp01 kernel: traps: gmetad[6977] general protection ip:7fc1bdde582c sp:7fc1ae469af0 error:0 in libganglia.so.0.0.0[7fc1bddda000+14000] Nov 15 12:27:41 rp01 systemd: gmetad.service: main process exited, code=killed, status=11/SEGV Nov 15 12:27:41 rp01 systemd: Unit gmetad.service entered failed state. Hope this helps and adds infomations, I will try to build a debug version of gmetad to see if it's possible to generate a core dump. Best Regards, Marcello On 23/11/2015 17:30, Devon H. O'Dell wrote: It's just a system versioning thing for shared libraries. Usually .so is a soft link to .so.0 which is a soft link to .so.0.0.0. This is intended to be an ABI versioning interface, but it's not super frequently used. Are these legitimately different files on your system? The crash is in hash_delete: 003b2c00b780 : ... 3b2c00b797: 48 8b 07mov(%rdi),%rax 3b2c00b79a: 48 8d 34 30 lea(%rax,%rsi,1),%rsi 3b2c00b79e: 48 39 f0cmp%rsi,%rax 3b2c00b7a1: 73 37 jae3b2c00b7da 3b2c00b7a3: 48 bf b3 01 00 00 00movabs $0x10001b3,%rdi 3b2c00b7aa: 01 00 00 3b2c00b7ad: 0f 1f 00nopl (%rax) 3b2c00b7b0: 0f b6 08movzbl (%rax),%ecx 3b2c00b7b3: 48 83 c0 01 add$0x1,%rax 3b2c00b7b7: 48 31 caxor%rcx,%rdx 3b2c00b7ba: 48 0f af d7 imul %rdi,%rdx 3b2c00b7be: 48 39 c6cmp%rax,%rsi 3b2c00b7c1: 77 ed ja 3b2c00b7b0 ... %rdi is the first argument to the function, so %rax is the datum_t *key, and (%rax) is key->data. hash_key has been inlined here. Unfortunately, what appears to be happening is that some key has already been removed from the hash table and freed, and based on your description of the problem, that was attempted concurrently. Your kernel crash shows that we were trying to dereference a NULL pointer, so it would appear that key->data is NULL. Unfortunately, it is not clear without a backtrace what sort of key specifically is in question here, but perhaps someone else might have some context based on recent changes. (I don't think this is related to my work on the hashes). Are you running any custom modules (either in C or Python)? Would it be possible for you to build gmond
Re: [Ganglia-general] gmetad segmentation fault
had a draft that I thought I had sent. Could you email me your libganglia.so binary off-list? Alternatively, do you have the ability to compile libganglia with debugging symbols? 2015-11-17 1:56 GMT-08:00 Cristovao Cordeiro mailto:cristovao.corde...@cern.ch>>: Hi everyone, any news on this? Another symptom is that this happens quite as often as the cluster changes, meaning that the more activity there is in the cluster (delete machines, create...) the more this issue happens. Could it be related with the deletion of old hosts by gmond causing gmetad to try to access files that are already gone? Cumprimentos / Best regards, Cristóvão José Domingues Cordeiro From: Cristovao Cordeiro [cristovao.corde...@cern.ch<mailto:cristovao.corde...@cern.ch>] Sent: 09 November 2015 13:40 To: Devon H. O'Dell Cc: Ganglia-general@lists.sourceforge.net<mailto:Ganglia-general@lists.sourceforge.net> Subject: Re: [Ganglia-general] gmetad segmentation fault Hi Devon, thanks! * I don't think there was a core dump. At least that is not stated in /var/log/messages and I don't find anything relevant in /var/spool/abrt/ * I am running 3.7.1 * The addr2line returns ??:0. Also with gdb: gdb /usr/lib64/libganglia.so.0.0.0 ... Reading symbols from /usr/lib64/libganglia.so.0.0.0...(no debugging symbols found)...done. Some more information about my setup: - I am running several gmonds in the same machine, so all my data_sources are to localhost. Cumprimentos / Best regards, Cristóvão José Domingues Cordeiro From: Devon H. O'Dell [devon.od...@gmail.com<mailto:devon.od...@gmail.com>] Sent: 09 November 2015 13:12 To: Cristovao Cordeiro Cc: Ganglia-general@lists.sourceforge.net<mailto:Ganglia-general@lists.sourceforge.net> Subject: Re: [Ganglia-general] gmetad segmentation fault Hi! I have a couple of initial questions that might help figure out the problem: * Did you get a core dump? * What version of ganglia are you running? * This crash happened within libganglia.so at offset 0xb7b0. Can you run: $ addr2line -e /path/to/libganglia.so.0.0.0 0xb7b0 and paste the output? If that does not work, there are a couple other things we can try to get information about the fault, but hopefully we can just work from there. Kind regards, Devon H. O'Dell 2015-11-09 0:13 GMT-08:00 Cristovao Cordeiro mailto:cristovao.corde...@cern.ch>>: Dear all, I have several Ganglia monitors running with similar configurations in different machines (VMs) and for a long time now I have been experiencing segmentation faults at random times. It seems to happen more on gmetads that are monitoring larger number of nodes. In /var/log/messages I see: kernel: gmetad[3948]: segfault at 0 ip 003630c0b7b0 sp 7f0ecbffebc0 error 4 in libganglia.so.0.0.0[3630c0+15000] and in the console output there's only this: /bin/bash: line 1: 30375 Terminated /usr/sbin/gmetad [FAILED] gmetad does not have any special configuration besides the RRD location which in on a 4Gb ramdisk. Cumprimentos / Best regards, Cristóvão José Domingues Cordeiro -- Presto, an open source distributed SQL query engine for big data, initially developed by Facebook, enables you to easily query your data on Hadoop in a more interactive manner. Teradata is also now providing full enterprise support for Presto. Download a free open source copy now. http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Presto, an open source distributed SQL query engine for big data, initially developed by Facebook, enables you to easily query your data on Hadoop in a more interactive manner. Teradata is also now providing full enterprise support for Presto. Download a free open source copy now. http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Go from Idea to Many App Stores Faster with Intel(R) XDK Give your users amazing mobile app experiences with Intel(R) XDK. Use one codebase in this all-in-one HTML5 development environment. Design, debug & build mobile apps & 2D/3D high-impact games for multiple OSs. http://pubads.g.doubleclick.net/gampad/clk?id=254741551&iu=/4140 ___ Ganglia-general mailing
Re: [Ganglia-general] gmetad segmentation fault
Hello, I'd like to join the discussion because this problem is affecting us as well. We have the problem on two different installations: 2 server in active-active HA configuration, each with CentOS 7.1 + ganglia 3.7.2 + rrdcached monitoring systems A,B,C,D 2 server in active-active HA configuration, each with RedHat 6.5 + ganglia 3.7.2 + rrdcached monitoring systems E,F,G,H In both cases the ganglia rpm packages are taken from EPEL repository. The curios thing is that every time that the segfault happens it happens almost at the same time. I.e. for Centos7 systems: Nov 15 12:27:35 rp02 kernel: traps: gmetad[2620] general protection ip:7fd70d62f82c sp:7fd6fdcb3af0 error:0 in libganglia.so.0.0.0[7fd70d624000+14000] Nov 15 12:27:35 rp02 systemd: gmetad.service: main process exited, code=killed, status=11/SEGV Nov 15 12:27:35 rp02 systemd: Unit gmetad.service entered failed state. Nov 15 12:27:41 rp01 kernel: traps: gmetad[6977] general protection ip:7fc1bdde582c sp:7fc1ae469af0 error:0 in libganglia.so.0.0.0[7fc1bddda000+14000] Nov 15 12:27:41 rp01 systemd: gmetad.service: main process exited, code=killed, status=11/SEGV Nov 15 12:27:41 rp01 systemd: Unit gmetad.service entered failed state. Hope this helps and adds infomations, I will try to build a debug version of gmetad to see if it's possible to generate a core dump. Best Regards, Marcello On 23/11/2015 17:30, Devon H. O'Dell wrote: > It's just a system versioning thing for shared libraries. Usually .so > is a soft link to .so.0 which is a soft link to .so.0.0.0. This is > intended to be an ABI versioning interface, but it's not super > frequently used. Are these legitimately different files on your > system? > > The crash is in hash_delete: > > 003b2c00b780 : > ... >3b2c00b797: 48 8b 07mov(%rdi),%rax >3b2c00b79a: 48 8d 34 30 lea(%rax,%rsi,1),%rsi >3b2c00b79e: 48 39 f0cmp%rsi,%rax >3b2c00b7a1: 73 37 jae3b2c00b7da >3b2c00b7a3: 48 bf b3 01 00 00 00movabs $0x10001b3,%rdi >3b2c00b7aa: 01 00 00 >3b2c00b7ad: 0f 1f 00nopl (%rax) >>>> 3b2c00b7b0: 0f b6 08movzbl (%rax),%ecx >3b2c00b7b3: 48 83 c0 01 add$0x1,%rax >3b2c00b7b7: 48 31 caxor%rcx,%rdx >3b2c00b7ba: 48 0f af d7 imul %rdi,%rdx >3b2c00b7be: 48 39 c6cmp%rax,%rsi >3b2c00b7c1: 77 ed ja 3b2c00b7b0 > ... > > %rdi is the first argument to the function, so %rax is the datum_t > *key, and (%rax) is key->data. hash_key has been inlined here. > Unfortunately, what appears to be happening is that some key has > already been removed from the hash table and freed, and based on your > description of the problem, that was attempted concurrently. Your > kernel crash shows that we were trying to dereference a NULL pointer, > so it would appear that key->data is NULL. > > Unfortunately, it is not clear without a backtrace what sort of key > specifically is in question here, but perhaps someone else might have > some context based on recent changes. (I don't think this is related > to my work on the hashes). > > Are you running any custom modules (either in C or Python)? Would it > be possible for you to build gmond and libganglia with debugging > symbols and generate a core dump? > > --dho > > > 2015-11-23 1:29 GMT-08:00 Cristovao Cordeiro : >> Hi Devon, >> >> thanks for the help. >> Attached follows the binary file. >> >> btw, what is the difference between so.0 and so.0.0.0? >> >> Cumprimentos / Best regards, >> Cristóvão José Domingues Cordeiro >> >> >> On 17 Nov 2015, at 19:16, Devon H. O'Dell wrote: >> >> Hi! Very sorry about this, I had a draft that I thought I had sent. >> >> Could you email me your libganglia.so binary off-list? Alternatively, >> do you have the ability to compile libganglia with debugging symbols? >> >> 2015-11-17 1:56 GMT-08:00 Cristovao Cordeiro : >> >> Hi everyone, >> >> any news on this? >> Another symptom is that this happens quite as often as the cluster changes, >> meaning that the more activity there is in the cluster (delete machines, >> create...) the more this issue happens. Could it be related with the >> deletion of old hosts by gmond causing gmetad to try to access files that >> are already gone? >> >> Cumprimentos / Best regards, >> Cristóvão José Domingues Cordeiro >> >> >> >> From: Cristovao Cordeiro [cristovao.corde...@cern.ch] >> S
Re: [Ganglia-general] gmetad segmentation fault
It's just a system versioning thing for shared libraries. Usually .so is a soft link to .so.0 which is a soft link to .so.0.0.0. This is intended to be an ABI versioning interface, but it's not super frequently used. Are these legitimately different files on your system? The crash is in hash_delete: 003b2c00b780 : ... 3b2c00b797: 48 8b 07mov(%rdi),%rax 3b2c00b79a: 48 8d 34 30 lea(%rax,%rsi,1),%rsi 3b2c00b79e: 48 39 f0cmp%rsi,%rax 3b2c00b7a1: 73 37 jae3b2c00b7da 3b2c00b7a3: 48 bf b3 01 00 00 00movabs $0x10001b3,%rdi 3b2c00b7aa: 01 00 00 3b2c00b7ad: 0f 1f 00nopl (%rax) >>> 3b2c00b7b0: 0f b6 08movzbl (%rax),%ecx 3b2c00b7b3: 48 83 c0 01 add$0x1,%rax 3b2c00b7b7: 48 31 caxor%rcx,%rdx 3b2c00b7ba: 48 0f af d7 imul %rdi,%rdx 3b2c00b7be: 48 39 c6cmp%rax,%rsi 3b2c00b7c1: 77 ed ja 3b2c00b7b0 ... %rdi is the first argument to the function, so %rax is the datum_t *key, and (%rax) is key->data. hash_key has been inlined here. Unfortunately, what appears to be happening is that some key has already been removed from the hash table and freed, and based on your description of the problem, that was attempted concurrently. Your kernel crash shows that we were trying to dereference a NULL pointer, so it would appear that key->data is NULL. Unfortunately, it is not clear without a backtrace what sort of key specifically is in question here, but perhaps someone else might have some context based on recent changes. (I don't think this is related to my work on the hashes). Are you running any custom modules (either in C or Python)? Would it be possible for you to build gmond and libganglia with debugging symbols and generate a core dump? --dho 2015-11-23 1:29 GMT-08:00 Cristovao Cordeiro : > Hi Devon, > > thanks for the help. > Attached follows the binary file. > > btw, what is the difference between so.0 and so.0.0.0? > > Cumprimentos / Best regards, > Cristóvão José Domingues Cordeiro > > > On 17 Nov 2015, at 19:16, Devon H. O'Dell wrote: > > Hi! Very sorry about this, I had a draft that I thought I had sent. > > Could you email me your libganglia.so binary off-list? Alternatively, > do you have the ability to compile libganglia with debugging symbols? > > 2015-11-17 1:56 GMT-08:00 Cristovao Cordeiro : > > Hi everyone, > > any news on this? > Another symptom is that this happens quite as often as the cluster changes, > meaning that the more activity there is in the cluster (delete machines, > create...) the more this issue happens. Could it be related with the > deletion of old hosts by gmond causing gmetad to try to access files that > are already gone? > > Cumprimentos / Best regards, > Cristóvão José Domingues Cordeiro > > > > From: Cristovao Cordeiro [cristovao.corde...@cern.ch] > Sent: 09 November 2015 13:40 > To: Devon H. O'Dell > Cc: Ganglia-general@lists.sourceforge.net > Subject: Re: [Ganglia-general] gmetad segmentation fault > > Hi Devon, > > thanks! > > * I don't think there was a core dump. At least that is not stated in > /var/log/messages and I don't find anything relevant in /var/spool/abrt/ > * I am running 3.7.1 > * The addr2line returns ??:0. Also with gdb: > > gdb /usr/lib64/libganglia.so.0.0.0 > > ... > Reading symbols from /usr/lib64/libganglia.so.0.0.0...(no debugging > symbols found)...done. > > Some more information about my setup: > - I am running several gmonds in the same machine, so all my data_sources > are to localhost. > > Cumprimentos / Best regards, > Cristóvão José Domingues Cordeiro > > > ____ > From: Devon H. O'Dell [devon.od...@gmail.com] > Sent: 09 November 2015 13:12 > To: Cristovao Cordeiro > Cc: Ganglia-general@lists.sourceforge.net > Subject: Re: [Ganglia-general] gmetad segmentation fault > > Hi! > > I have a couple of initial questions that might help figure out the problem: > > * Did you get a core dump? > * What version of ganglia are you running? > * This crash happened within libganglia.so at offset 0xb7b0. Can you run: > > $ addr2line -e /path/to/libganglia.so.0.0.0 0xb7b0 > > and paste the output? If that does not work, there are a couple other > things we can try to get information about the fault, but hopefully we > can just work from there. > > Kind regards, > > Devon H. O'Dell > > 2015-11-09 0:13 GMT-08:00 Cristovao Cordeiro : > > Dear all, > > I have several Ganglia monitors running
Re: [Ganglia-general] gmetad segmentation fault
Hi! Very sorry about this, I had a draft that I thought I had sent. Could you email me your libganglia.so binary off-list? Alternatively, do you have the ability to compile libganglia with debugging symbols? 2015-11-17 1:56 GMT-08:00 Cristovao Cordeiro : > Hi everyone, > > any news on this? > Another symptom is that this happens quite as often as the cluster changes, > meaning that the more activity there is in the cluster (delete machines, > create...) the more this issue happens. Could it be related with the deletion > of old hosts by gmond causing gmetad to try to access files that are already > gone? > > Cumprimentos / Best regards, > Cristóvão José Domingues Cordeiro > > > > From: Cristovao Cordeiro [cristovao.corde...@cern.ch] > Sent: 09 November 2015 13:40 > To: Devon H. O'Dell > Cc: Ganglia-general@lists.sourceforge.net > Subject: Re: [Ganglia-general] gmetad segmentation fault > > Hi Devon, > > thanks! > > * I don't think there was a core dump. At least that is not stated in > /var/log/messages and I don't find anything relevant in /var/spool/abrt/ > * I am running 3.7.1 > * The addr2line returns ??:0. Also with gdb: >> gdb /usr/lib64/libganglia.so.0.0.0 >... >Reading symbols from /usr/lib64/libganglia.so.0.0.0...(no debugging > symbols found)...done. > > Some more information about my setup: > - I am running several gmonds in the same machine, so all my data_sources > are to localhost. > > Cumprimentos / Best regards, > Cristóvão José Domingues Cordeiro > > > > From: Devon H. O'Dell [devon.od...@gmail.com] > Sent: 09 November 2015 13:12 > To: Cristovao Cordeiro > Cc: Ganglia-general@lists.sourceforge.net > Subject: Re: [Ganglia-general] gmetad segmentation fault > > Hi! > > I have a couple of initial questions that might help figure out the problem: > > * Did you get a core dump? > * What version of ganglia are you running? > * This crash happened within libganglia.so at offset 0xb7b0. Can you run: > > $ addr2line -e /path/to/libganglia.so.0.0.0 0xb7b0 > > and paste the output? If that does not work, there are a couple other > things we can try to get information about the fault, but hopefully we > can just work from there. > > Kind regards, > > Devon H. O'Dell > > 2015-11-09 0:13 GMT-08:00 Cristovao Cordeiro : >> Dear all, >> >> I have several Ganglia monitors running with similar configurations in >> different machines (VMs) and for a long time now I have been experiencing >> segmentation faults at random times. It seems to happen more on gmetads that >> are monitoring larger number of nodes. >> >> In /var/log/messages I see: >> >> kernel: gmetad[3948]: segfault at 0 ip 003630c0b7b0 sp 7f0ecbffebc0 >> error 4 in libganglia.so.0.0.0[3630c0+15000] >> >> >> and in the console output there's only this: >> >> /bin/bash: line 1: 30375 Terminated /usr/sbin/gmetad >> >>[FAILED] >> >> >> gmetad does not have any special configuration besides the RRD location >> which in on a 4Gb ramdisk. >> >> >> Cumprimentos / Best regards, >> Cristóvão José Domingues Cordeiro >> >> >> -- >> Presto, an open source distributed SQL query engine for big data, initially >> developed by Facebook, enables you to easily query your data on Hadoop in a >> more interactive manner. Teradata is also now providing full enterprise >> support for Presto. Download a free open source copy now. >> http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140 >> ___ >> Ganglia-general mailing list >> Ganglia-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/ganglia-general >> > > -- > Presto, an open source distributed SQL query engine for big data, initially > developed by Facebook, enables you to easily query your data on Hadoop in a > more interactive manner. Teradata is also now providing full enterprise > support for Presto. Download a free open source copy now. > http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140 > ___ > Ganglia-general mailing list > Ganglia-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ganglia-general -- ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetad segmentation fault
Hi everyone, any news on this? Another symptom is that this happens quite as often as the cluster changes, meaning that the more activity there is in the cluster (delete machines, create...) the more this issue happens. Could it be related with the deletion of old hosts by gmond causing gmetad to try to access files that are already gone? Cumprimentos / Best regards, Cristóvão José Domingues Cordeiro From: Cristovao Cordeiro [cristovao.corde...@cern.ch] Sent: 09 November 2015 13:40 To: Devon H. O'Dell Cc: Ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] gmetad segmentation fault Hi Devon, thanks! * I don't think there was a core dump. At least that is not stated in /var/log/messages and I don't find anything relevant in /var/spool/abrt/ * I am running 3.7.1 * The addr2line returns ??:0. Also with gdb: > gdb /usr/lib64/libganglia.so.0.0.0 ... Reading symbols from /usr/lib64/libganglia.so.0.0.0...(no debugging symbols found)...done. Some more information about my setup: - I am running several gmonds in the same machine, so all my data_sources are to localhost. Cumprimentos / Best regards, Cristóvão José Domingues Cordeiro From: Devon H. O'Dell [devon.od...@gmail.com] Sent: 09 November 2015 13:12 To: Cristovao Cordeiro Cc: Ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] gmetad segmentation fault Hi! I have a couple of initial questions that might help figure out the problem: * Did you get a core dump? * What version of ganglia are you running? * This crash happened within libganglia.so at offset 0xb7b0. Can you run: $ addr2line -e /path/to/libganglia.so.0.0.0 0xb7b0 and paste the output? If that does not work, there are a couple other things we can try to get information about the fault, but hopefully we can just work from there. Kind regards, Devon H. O'Dell 2015-11-09 0:13 GMT-08:00 Cristovao Cordeiro : > Dear all, > > I have several Ganglia monitors running with similar configurations in > different machines (VMs) and for a long time now I have been experiencing > segmentation faults at random times. It seems to happen more on gmetads that > are monitoring larger number of nodes. > > In /var/log/messages I see: > > kernel: gmetad[3948]: segfault at 0 ip 003630c0b7b0 sp 7f0ecbffebc0 > error 4 in libganglia.so.0.0.0[3630c0+15000] > > > and in the console output there's only this: > > /bin/bash: line 1: 30375 Terminated /usr/sbin/gmetad > >[FAILED] > > > gmetad does not have any special configuration besides the RRD location > which in on a 4Gb ramdisk. > > > Cumprimentos / Best regards, > Cristóvão José Domingues Cordeiro > > > -- > Presto, an open source distributed SQL query engine for big data, initially > developed by Facebook, enables you to easily query your data on Hadoop in a > more interactive manner. Teradata is also now providing full enterprise > support for Presto. Download a free open source copy now. > http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140 > ___ > Ganglia-general mailing list > Ganglia-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ganglia-general > -- Presto, an open source distributed SQL query engine for big data, initially developed by Facebook, enables you to easily query your data on Hadoop in a more interactive manner. Teradata is also now providing full enterprise support for Presto. Download a free open source copy now. http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetad segmentation fault
Hi Devon, thanks! * I don't think there was a core dump. At least that is not stated in /var/log/messages and I don't find anything relevant in /var/spool/abrt/ * I am running 3.7.1 * The addr2line returns ??:0. Also with gdb: > gdb /usr/lib64/libganglia.so.0.0.0 ... Reading symbols from /usr/lib64/libganglia.so.0.0.0...(no debugging symbols found)...done. Some more information about my setup: - I am running several gmonds in the same machine, so all my data_sources are to localhost. Cumprimentos / Best regards, Cristóvão José Domingues Cordeiro From: Devon H. O'Dell [devon.od...@gmail.com] Sent: 09 November 2015 13:12 To: Cristovao Cordeiro Cc: Ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] gmetad segmentation fault Hi! I have a couple of initial questions that might help figure out the problem: * Did you get a core dump? * What version of ganglia are you running? * This crash happened within libganglia.so at offset 0xb7b0. Can you run: $ addr2line -e /path/to/libganglia.so.0.0.0 0xb7b0 and paste the output? If that does not work, there are a couple other things we can try to get information about the fault, but hopefully we can just work from there. Kind regards, Devon H. O'Dell 2015-11-09 0:13 GMT-08:00 Cristovao Cordeiro : > Dear all, > > I have several Ganglia monitors running with similar configurations in > different machines (VMs) and for a long time now I have been experiencing > segmentation faults at random times. It seems to happen more on gmetads that > are monitoring larger number of nodes. > > In /var/log/messages I see: > > kernel: gmetad[3948]: segfault at 0 ip 003630c0b7b0 sp 7f0ecbffebc0 > error 4 in libganglia.so.0.0.0[3630c0+15000] > > > and in the console output there's only this: > > /bin/bash: line 1: 30375 Terminated /usr/sbin/gmetad > >[FAILED] > > > gmetad does not have any special configuration besides the RRD location > which in on a 4Gb ramdisk. > > > Cumprimentos / Best regards, > Cristóvão José Domingues Cordeiro > > > -- > Presto, an open source distributed SQL query engine for big data, initially > developed by Facebook, enables you to easily query your data on Hadoop in a > more interactive manner. Teradata is also now providing full enterprise > support for Presto. Download a free open source copy now. > http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140 > ___ > Ganglia-general mailing list > Ganglia-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ganglia-general > -- Presto, an open source distributed SQL query engine for big data, initially developed by Facebook, enables you to easily query your data on Hadoop in a more interactive manner. Teradata is also now providing full enterprise support for Presto. Download a free open source copy now. http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetad segmentation fault
Hi! I have a couple of initial questions that might help figure out the problem: * Did you get a core dump? * What version of ganglia are you running? * This crash happened within libganglia.so at offset 0xb7b0. Can you run: $ addr2line -e /path/to/libganglia.so.0.0.0 0xb7b0 and paste the output? If that does not work, there are a couple other things we can try to get information about the fault, but hopefully we can just work from there. Kind regards, Devon H. O'Dell 2015-11-09 0:13 GMT-08:00 Cristovao Cordeiro : > Dear all, > > I have several Ganglia monitors running with similar configurations in > different machines (VMs) and for a long time now I have been experiencing > segmentation faults at random times. It seems to happen more on gmetads that > are monitoring larger number of nodes. > > In /var/log/messages I see: > > kernel: gmetad[3948]: segfault at 0 ip 003630c0b7b0 sp 7f0ecbffebc0 > error 4 in libganglia.so.0.0.0[3630c0+15000] > > > and in the console output there's only this: > > /bin/bash: line 1: 30375 Terminated /usr/sbin/gmetad > >[FAILED] > > > gmetad does not have any special configuration besides the RRD location > which in on a 4Gb ramdisk. > > > Cumprimentos / Best regards, > Cristóvão José Domingues Cordeiro > > > -- > Presto, an open source distributed SQL query engine for big data, initially > developed by Facebook, enables you to easily query your data on Hadoop in a > more interactive manner. Teradata is also now providing full enterprise > support for Presto. Download a free open source copy now. > http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140 > ___ > Ganglia-general mailing list > Ganglia-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/ganglia-general > -- Presto, an open source distributed SQL query engine for big data, initially developed by Facebook, enables you to easily query your data on Hadoop in a more interactive manner. Teradata is also now providing full enterprise support for Presto. Download a free open source copy now. http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general