Re: [Ganglia-developers] Gmetad bottlenecks
Hi Devon, I think now that we the ability to define exactly which metrics should and should not be summarised then the issue of slow-downs due to metric summarisation can be managed. If we are to look at redoing the XML parsing next then the two contenders that come to mind are gzipped JSON and Google Protocol Buffers. PB is meant to be very efficient and therefore faster, however it seems people have gotten comparable results with gzipped JSON. An obvious advantage of gzipped JSON is that it would be simple to make the output human readable though we could easily develop a CLI tool that allowed us to query and decode ganglia PB data for testing. What do others think? --Nick. On Tue, Jan 14, 2014 at 4:42 PM, Devon H. O'Dell devon.od...@gmail.comwrote: I don't personally have any objections, but if this remains a pain point, perhaps this is something we can address differently? I think where I left off, XML parsing was the taking the most time; is that something that people are comfortable with changing (data format?) --dho 2014/1/14 Nicholas Satterly nfsatte...@gmail.com: Given the performance benefits gained by Devon's work I will revert the patch that attempted to speed up metric summaries because it's causing grid-of-grids to fail (unless there are any objections) ... https://github.com/ganglia/monitor-core/commit/0705a5defa284e289004daf61ea390338719d5fb --Nick. On Tue, Dec 10, 2013 at 8:00 PM, Chris Burroughs chris.burrou...@gmail.com wrote: On 12/08/2013 04:43 PM, Devon H. O'Dell wrote: This is a simple `perf top -p $PID` on one of of our gmetad nodes Samples: 1M of event 'cycles', Event count (approx.): 64115959770 6.59% libexpat.so.1.5.2 [.] 0x00011b8d 4.77% libganglia-3.6.0.so.0.0.0 [.] hashval 2.62% [kernel] [k] __d_lookup 2.21% [kernel] [k] _spin_lock 2.14% libc-2.12.so [.] vfprintf 1.61% librrd.so.4.2.0[.] process_arg 1.54% libganglia-3.6.0.so.0.0.0 [.] hash_lookup 1.46% [kernel] [k] __link_path_walk 1.16% libc-2.12.so [.] __GI_strtod_l_internal 1.11% libc-2.12.so [.] memcpy 1.08% libc-2.12.so [.] _int_malloc So I suppose my intuition about xml parsing expense is off. I have not used perf as much as I should, if we were seeing similar rrd writing contention should I literally see stat near the top? Ah, so to see what's really going on: perf record -e cpu-clock -g -p $PID Let that run for a minute or two. Then: perf report --sort=comm,dso,symbol -G If you don't have cpu-clock, cycles is OK, but you definitely are going to want to see the callgraph. The time in XML is mostly writing RRDs and you only see that digging down into the chain. For the list, Devon and I spoke in #ganglia and the high occurrence of libexpat in this sample seems to be an artifact of missing debug symbols. -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers -- gpg: using PGP trust model pub 4096R/1EE38BD9 2013-01-06 [expires: 2018-01-06] Key fingerprint = 3EE9 550D D9D8 DB65 58C2 B58D CE78 EC6C 1EE3 8BD9 uid Nicholas Satterly (Debian Key) nfsatte...@gmail.com sub 4096R/23804EE9 2013-01-06 [expires: 2018-01-06] -- gpg: using PGP trust model pub 4096R/1EE38BD9 2013-01-06 [expires: 2018-01-06] Key fingerprint = 3EE9 550D D9D8 DB65 58C2 B58D CE78 EC6C 1EE3 8BD9 uid Nicholas Satterly (Debian Key) nfsatte...@gmail.com sub 4096R/23804EE9 2013-01-06 [expires: 2018-01-06] -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Gmetad bottlenecks
Given the performance benefits gained by Devon's work I will revert the patch that attempted to speed up metric summaries because it's causing grid-of-grids to fail (unless there are any objections) ... https://github.com/ganglia/monitor-core/commit/0705a5defa284e289004daf61ea390338719d5fb --Nick. On Tue, Dec 10, 2013 at 8:00 PM, Chris Burroughs chris.burrou...@gmail.comwrote: On 12/08/2013 04:43 PM, Devon H. O'Dell wrote: This is a simple `perf top -p $PID` on one of of our gmetad nodes Samples: 1M of event 'cycles', Event count (approx.): 64115959770 6.59% libexpat.so.1.5.2 [.] 0x00011b8d 4.77% libganglia-3.6.0.so.0.0.0 [.] hashval 2.62% [kernel] [k] __d_lookup 2.21% [kernel] [k] _spin_lock 2.14% libc-2.12.so [.] vfprintf 1.61% librrd.so.4.2.0[.] process_arg 1.54% libganglia-3.6.0.so.0.0.0 [.] hash_lookup 1.46% [kernel] [k] __link_path_walk 1.16% libc-2.12.so [.] __GI_strtod_l_internal 1.11% libc-2.12.so [.] memcpy 1.08% libc-2.12.so [.] _int_malloc So I suppose my intuition about xml parsing expense is off. I have not used perf as much as I should, if we were seeing similar rrd writing contention should I literally see stat near the top? Ah, so to see what's really going on: perf record -e cpu-clock -g -p $PID Let that run for a minute or two. Then: perf report --sort=comm,dso,symbol -G If you don't have cpu-clock, cycles is OK, but you definitely are going to want to see the callgraph. The time in XML is mostly writing RRDs and you only see that digging down into the chain. For the list, Devon and I spoke in #ganglia and the high occurrence of libexpat in this sample seems to be an artifact of missing debug symbols. -- Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349831iu=/4140/ostg.clktrk ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers -- gpg: using PGP trust model pub 4096R/1EE38BD9 2013-01-06 [expires: 2018-01-06] Key fingerprint = 3EE9 550D D9D8 DB65 58C2 B58D CE78 EC6C 1EE3 8BD9 uid Nicholas Satterly (Debian Key) nfsatte...@gmail.com sub 4096R/23804EE9 2013-01-06 [expires: 2018-01-06] -- CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431iu=/4140/ostg.clktrk___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Gmetad bottlenecks
. They are running on (censored) right now, and we'll leave them running for a while to make sure they're good before pushing the patches upstream. In the process of doing this, I noticed that ganglia used a particularly poor method for reading its XML metrics from gmond: It initialized a 1024-byte buffer, read into it, and if it would overflow, it would realloc the buffer with an additional 1024 bytes and try reading again. When dealing with XML files many megabytes in size, this caused many unnecessary reallocations. I modified this code to start with a 128KB buffer and double the buffer size when it runs out of space. (I made a similar change to the code for decompressing gzip'ed data that used a similar buffer sizing paradigm). After all these changes, both the interactive and RRD-writing processes spend most of their time in the hash table. I can continue improving Ganglia performance, but most of the low hanging fruit is now gone; at some me point it will require: * writing a version of librrd (this probably also means changing the rrd file format), * replacing the hash table in Ganglia with one that performs better, * changing the data serialization format from XML to one that is easier / faster to parse, * using a different data structure than a hash table for metrics hierarchies (probably a tree with metrics stored at each level in contiguous memory and an index describing each metric at each level) * refactoring gmetad and gmond into a single process that shares memory These are all longer-term projects, but I think that they'll probably eventually be useful. -- *** This message originated from the Internet. Its originator may or may not be who they claim to be and the information contained in the message and any attachments may or may not be accurate. *** -- Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk - ** This message originated from the Internet. Its originator may or may not be who they claim to be and the information contained in the message and any attachments may or may not be accurate. **___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers - ** This message originated from the Internet. Its originator may or may not be who they claim to be and the information contained in the message and any attachments may or may not be accurate. ** - ** This E-mail is confidential. It may also be legally privileged. If you are not the addressee you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return E-mail. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. The sender does not accept liability for any errors or omissions. ** SAVE PAPER - THINK BEFORE YOU PRINT! -- Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers -- gpg: using PGP trust model pub 4096R/1EE38BD9 2013-01-06 [expires: 2018-01-06] Key fingerprint = 3EE9 550D D9D8 DB65 58C2 B58D CE78 EC6C 1EE3 8BD9 uid Nicholas Satterly (Debian Key) nfsatte...@gmail.com sub 4096R/23804EE9 2013-01-06 [expires: 2018-01-06] -- Sponsored by Intel(R) XDK Develop, test and display web and hybrid apps with a single code base. Download it for free now! http://pubads.g.doubleclick.net/gampad/clk?id=111408631iu=/4140/ostg.clktrk___ Ganglia
Re: [Ganglia-developers] [Ganglia-general] Grid of Grids Broken Again in 3.6.0? Is this a different problem?
Hi Adam, Our experience was that the summary RRDs were actually generated but then rarely updated. Only very occasionally would we see metrics suddenly get written to the RRD and only for a few intervals and then there would be large gaps again. Do graphs based on the RRDs you are getting in your tests look right? Regards, Nick On Fri, Nov 15, 2013 at 8:15 PM, Adam Compton acomp...@quantcast.comwrote: Nicholas, I'm the person who submitted #92. I've attempted to replicate the problem and I'm still seeing summary RRDs being written for the top grid in a grid-of-grids configuration (assuming you mean /var/lib/ganglia/rrds/__SummaryInfo__/*.rrd). Can you please share the configs you used to reproduce this issue? I'd like to fix the bug and submit a patch, but I don't know how to replicate the problem. Thanks, Adam On 11/3/13 2:04 PM, Nicholas Satterly wrote: Hi Bernard, I think this is the bug in federation that you might be thinking of as I've mentioned it before. I don't have a fix for this. It's quite a large patch and I've never looked at this part of the codebase before. Regards, Nick On Sun, Nov 3, 2013 at 5:10 PM, Bernard Li bern...@vanhpc.org wrote: My $0.02 is that Grid of Grids (federation) is still a widely used feature so we should attempt to fix it. Nick -- do you still have another outstanding pull request to fix a bug in federation? If so, what's the hold up? Just waiting for someone with authorization to accept it? Thanks! Bernard On Sat, Nov 2, 2013 at 5:14 PM, Nicholas Satterly nfsatte...@gmail.comwrote: I have confirmed that this patch [1] broke writing of the root summaries for the top-level gmetad when in a grid-of-grids setup. What should we do? Revert the patch, attempt to debug it, or just log a github issue to track it for now? Regards, Nick [1] https://github.com/ganglia/monitor-core/pull/92 On Tue, Sep 24, 2013 at 12:40 PM, Nicholas Satterly nfsatte...@gmail.com wrote: Hi Illydth, You might have missed that the pull request that added the break back also added more logic to the endElement_GRID() function to fix double-writing of the last cluster. So yes, that break is meant to be there again. See https://github.com/ganglia/monitor-core/pull/73 However, what isn't clear is why there is a new grid-of-grids problem. I suspect that it relates to this pull request but I haven't been able to confirm this yet. See https://github.com/ganglia/monitor-core/pull/92 Regards, Nick On Fri, Sep 20, 2013 at 7:41 PM, Douglas Wagner dougla...@gmail.comwrote: So the last time I tried this upgrade thing (3.1.7 - 3.4.0) I was getting no grid of grids information. Ran across the fix with the help of others on the list and documented it here: http://sourceforge.net/apps/phpbb/ganglia/viewtopic.php?f=4t=16p=28 So now I've upgraded from 3.4.0 to 3.6.0. I have 2 new clients (RHEL6) that I'm implementing. Went through the build process and built out RPMs for RHEL6. Turned on GMOND and I'm not seeing either of the two systems reporting into the associated GMETAD. The Web Interface isn't updating with the new boxes. As I start going back through some of my past issues, I ran back across this where in 3.4.0 Grid of Grids was broken. And when I check the reported file and problem again I see the same old code (the break; at the end of the first switch block). Is this broken again in 3.6? or is this the correct code and I should be looking somewhere else for why my new RHEL6 clients aren't reporting to my GMETAD system? --Illydth -- LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99! 1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. http://pubads.g.doubleclick.net/gampad/clk?id=58041151iu=/4140/ostg.clktrk ___ Ganglia-general mailing list ganglia-gene...@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- gpg: using PGP trust model pub 4096R/1EE38BD9 2013-01-06 [expires: 2018-01-06] Key fingerprint = 3EE9 550D D9D8 DB65 58C2 B58D CE78 EC6C 1EE3 8BD9 uid Nicholas Satterly (Debian Key) nfsatte...@gmail.com sub 4096R/23804EE9 2013-01-06 [expires: 2018-01-06] -- gpg: using PGP trust model pub 4096R/1EE38BD9 2013-01-06 [expires: 2018-01-06] Key fingerprint = 3EE9 550D D9D8 DB65 58C2 B58D CE78 EC6C 1EE3 8BD9 uid Nicholas Satterly (Debian Key) nfsatte...@gmail.com sub 4096R/23804EE9 2013-01-06 [expires: 2018-01-06] -- Android is increasing in popularity
Re: [Ganglia-developers] Riemann pull request for Ganglia
Yep. Done. On 12 Nov 2013, at 23:32, Bernard Li bern...@vanhpc.org wrote: Hi Nick: Cool -- do you think you can add a link to the page you created in the main Trac Wiki page? http://sourceforge.net/apps/trac/ganglia/wiki Thanks, Bernard On Tue, Nov 12, 2013 at 1:40 PM, Nicholas Satterly nfsatte...@gmail.com wrote: And just to close the loop... Ganglia now gets a mention on the Riemann website http://riemann.io/clients.html --Nick On Tue, Nov 12, 2013 at 11:06 AM, Nicholas Satterly nfsatte...@gmail.com wrote: Thanks. Page added ... http://sourceforge.net/apps/trac/ganglia/wiki/riemann_integration --Nick. On Mon, Nov 11, 2013 at 10:50 PM, Bernard Li bern...@vanhpc.org wrote: Fixed. Cheers, Bernard On Mon, Nov 11, 2013 at 8:26 AM, Nicholas Satterly nfsatte...@gmail.com wrote: Hi, I've written a wiki page for trac/sourceforge but don't seem to have edit rights -- I can't see an Edit this Page button on any of the Ganglia wiki pages (eg. https://sourceforge.net/apps/trac/ganglia/wiki) even though I'm logged in as satterly. If someone could fix this that would be great. If that's too hard feel free to add the page yourself, if you can (file attached). A link to it from the main page would be nice too. Thanks, Nick On Fri, Nov 8, 2013 at 5:29 PM, Jeff Buchbinder rufustfire...@gmail.com wrote: On Fri, Nov 8, 2013 at 12:27 PM, Bernard Li bern...@vanhpc.org wrote: Jeff: I'm talking about the Wiki hosted at SourceForge. However I'm uncertain if that has been deprecated in favour of the new one on GitHub. Vlad? I had been trying to migrate from the Sourceforge one to the Github wiki, but I'm not sure if we're *officially* designating the Github wiki to be the authoritative source of Ganglia knowledge. Jeff -- gpg: using PGP trust model pub 4096R/1EE38BD9 2013-01-06 [expires: 2018-01-06] Key fingerprint = 3EE9 550D D9D8 DB65 58C2 B58D CE78 EC6C 1EE3 8BD9 uid Nicholas Satterly (Debian Key) nfsatte...@gmail.com sub 4096R/23804EE9 2013-01-06 [expires: 2018-01-06] -- gpg: using PGP trust model pub 4096R/1EE38BD9 2013-01-06 [expires: 2018-01-06] Key fingerprint = 3EE9 550D D9D8 DB65 58C2 B58D CE78 EC6C 1EE3 8BD9 uid Nicholas Satterly (Debian Key) nfsatte...@gmail.com sub 4096R/23804EE9 2013-01-06 [expires: 2018-01-06] -- gpg: using PGP trust model pub 4096R/1EE38BD9 2013-01-06 [expires: 2018-01-06] Key fingerprint = 3EE9 550D D9D8 DB65 58C2 B58D CE78 EC6C 1EE3 8BD9 uid Nicholas Satterly (Debian Key) nfsatte...@gmail.com sub 4096R/23804EE9 2013-01-06 [expires: 2018-01-06] -- DreamFactory - Open Source REST JSON Services for HTML5 Native Apps OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API Access Free app hosting. Or install the open source package on any LAMP server. Sign up and see examples for AngularJS, jQuery, Sencha Touch and Native! http://pubads.g.doubleclick.net/gampad/clk?id=63469471iu=/4140/ostg.clktrk ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers -- DreamFactory - Open Source REST JSON Services for HTML5 Native Apps OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API Access Free app hosting. Or install the open source package on any LAMP server. Sign up and see examples for AngularJS, jQuery, Sencha Touch and Native! http://pubads.g.doubleclick.net/gampad/clk?id=63469471iu=/4140/ostg.clktrk ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Riemann pull request for Ganglia
And just to close the loop... Ganglia now gets a mention on the Riemann website http://riemann.io/clients.html --Nick On Tue, Nov 12, 2013 at 11:06 AM, Nicholas Satterly nfsatte...@gmail.comwrote: Thanks. Page added ... http://sourceforge.net/apps/trac/ganglia/wiki/riemann_integration --Nick. On Mon, Nov 11, 2013 at 10:50 PM, Bernard Li bern...@vanhpc.org wrote: Fixed. Cheers, Bernard On Mon, Nov 11, 2013 at 8:26 AM, Nicholas Satterly nfsatte...@gmail.comwrote: Hi, I've written a wiki page for trac/sourceforge but don't seem to have edit rights -- I can't see an Edit this Page button on any of the Ganglia wiki pages (eg. https://sourceforge.net/apps/trac/ganglia/wiki) even though I'm logged in as satterly. If someone could fix this that would be great. If that's too hard feel free to add the page yourself, if you can (file attached). A link to it from the main page would be nice too. Thanks, Nick On Fri, Nov 8, 2013 at 5:29 PM, Jeff Buchbinder rufustfire...@gmail.com wrote: On Fri, Nov 8, 2013 at 12:27 PM, Bernard Li bern...@vanhpc.org wrote: Jeff: I'm talking about the Wiki hosted at SourceForge. However I'm uncertain if that has been deprecated in favour of the new one on GitHub. Vlad? I had been trying to migrate from the Sourceforge one to the Github wiki, but I'm not sure if we're *officially* designating the Github wiki to be the authoritative source of Ganglia knowledge. Jeff -- gpg: using PGP trust model pub 4096R/1EE38BD9 2013-01-06 [expires: 2018-01-06] Key fingerprint = 3EE9 550D D9D8 DB65 58C2 B58D CE78 EC6C 1EE3 8BD9 uid Nicholas Satterly (Debian Key) nfsatte...@gmail.com sub 4096R/23804EE9 2013-01-06 [expires: 2018-01-06] -- gpg: using PGP trust model pub 4096R/1EE38BD9 2013-01-06 [expires: 2018-01-06] Key fingerprint = 3EE9 550D D9D8 DB65 58C2 B58D CE78 EC6C 1EE3 8BD9 uid Nicholas Satterly (Debian Key) nfsatte...@gmail.com sub 4096R/23804EE9 2013-01-06 [expires: 2018-01-06] -- gpg: using PGP trust model pub 4096R/1EE38BD9 2013-01-06 [expires: 2018-01-06] Key fingerprint = 3EE9 550D D9D8 DB65 58C2 B58D CE78 EC6C 1EE3 8BD9 uid Nicholas Satterly (Debian Key) nfsatte...@gmail.com sub 4096R/23804EE9 2013-01-06 [expires: 2018-01-06] -- DreamFactory - Open Source REST JSON Services for HTML5 Native Apps OAuth, Users, Roles, SQL, NoSQL, BLOB Storage and External API Access Free app hosting. Or install the open source package on any LAMP server. Sign up and see examples for AngularJS, jQuery, Sencha Touch and Native! http://pubads.g.doubleclick.net/gampad/clk?id=63469471iu=/4140/ostg.clktrk___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
[Ganglia-developers] Riemann pull request for Ganglia
Hi developers, I've done some work recently to add Riemann support to Ganglia for which I've submitted a pull request [1]. We are currently using this in production at the Guardian to alert in real-time off tens of thousands of metrics. (You can see our config here https://github.com/guardian/riemann-config ) It would be great if this was accepted by upstream as I know there is a lot of interest in alerting off real-time metric data recently and this is a solution that scales and makes use of a lot of the meta data that Ganglia associates with a metric/host. Feedback welcome. Regards, Nick [1] https://github.com/ganglia/monitor-core/pull/124 -- November Webinars for C, C++, Fortran Developers Accelerate application performance with scalable programming models. Explore techniques for threading, error checking, porting, and tuning. Get the most from the latest Intel processors and coprocessors. See abstracts and register http://pubads.g.doubleclick.net/gampad/clk?id=60136231iu=/4140/ostg.clktrk___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] [Ganglia-general] Grid of Grids Broken Again in 3.6.0? Is this a different problem?
Hi Bernard, I think this is the bug in federation that you might be thinking of as I've mentioned it before. I don't have a fix for this. It's quite a large patch and I've never looked at this part of the codebase before. Regards, Nick On Sun, Nov 3, 2013 at 5:10 PM, Bernard Li bern...@vanhpc.org wrote: My $0.02 is that Grid of Grids (federation) is still a widely used feature so we should attempt to fix it. Nick -- do you still have another outstanding pull request to fix a bug in federation? If so, what's the hold up? Just waiting for someone with authorization to accept it? Thanks! Bernard On Sat, Nov 2, 2013 at 5:14 PM, Nicholas Satterly nfsatte...@gmail.comwrote: I have confirmed that this patch [1] broke writing of the root summaries for the top-level gmetad when in a grid-of-grids setup. What should we do? Revert the patch, attempt to debug it, or just log a github issue to track it for now? Regards, Nick [1] https://github.com/ganglia/monitor-core/pull/92 On Tue, Sep 24, 2013 at 12:40 PM, Nicholas Satterly nfsatte...@gmail.com wrote: Hi Illydth, You might have missed that the pull request that added the break back also added more logic to the endElement_GRID() function to fix double-writing of the last cluster. So yes, that break is meant to be there again. See https://github.com/ganglia/monitor-core/pull/73 However, what isn't clear is why there is a new grid-of-grids problem. I suspect that it relates to this pull request but I haven't been able to confirm this yet. See https://github.com/ganglia/monitor-core/pull/92 Regards, Nick On Fri, Sep 20, 2013 at 7:41 PM, Douglas Wagner dougla...@gmail.comwrote: So the last time I tried this upgrade thing (3.1.7 - 3.4.0) I was getting no grid of grids information. Ran across the fix with the help of others on the list and documented it here: http://sourceforge.net/apps/phpbb/ganglia/viewtopic.php?f=4t=16p=28 So now I've upgraded from 3.4.0 to 3.6.0. I have 2 new clients (RHEL6) that I'm implementing. Went through the build process and built out RPMs for RHEL6. Turned on GMOND and I'm not seeing either of the two systems reporting into the associated GMETAD. The Web Interface isn't updating with the new boxes. As I start going back through some of my past issues, I ran back across this where in 3.4.0 Grid of Grids was broken. And when I check the reported file and problem again I see the same old code (the break; at the end of the first switch block). Is this broken again in 3.6? or is this the correct code and I should be looking somewhere else for why my new RHEL6 clients aren't reporting to my GMETAD system? --Illydth -- LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99! 1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. http://pubads.g.doubleclick.net/gampad/clk?id=58041151iu=/4140/ostg.clktrk ___ Ganglia-general mailing list ganglia-gene...@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- gpg: using PGP trust model pub 4096R/1EE38BD9 2013-01-06 [expires: 2018-01-06] Key fingerprint = 3EE9 550D D9D8 DB65 58C2 B58D CE78 EC6C 1EE3 8BD9 uid Nicholas Satterly (Debian Key) nfsatte...@gmail.com sub 4096R/23804EE9 2013-01-06 [expires: 2018-01-06] -- gpg: using PGP trust model pub 4096R/1EE38BD9 2013-01-06 [expires: 2018-01-06] Key fingerprint = 3EE9 550D D9D8 DB65 58C2 B58D CE78 EC6C 1EE3 8BD9 uid Nicholas Satterly (Debian Key) nfsatte...@gmail.com sub 4096R/23804EE9 2013-01-06 [expires: 2018-01-06] -- Android is increasing in popularity, but the open development platform that developers love is also attractive to malware creators. Download this white paper to learn more about secure code signing practices that can help keep Android apps secure. http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers -- gpg: using PGP trust model pub 4096R/1EE38BD9 2013-01-06 [expires: 2018-01-06] Key fingerprint = 3EE9 550D D9D8 DB65 58C2 B58D CE78 EC6C 1EE3 8BD9 uid Nicholas Satterly (Debian Key) nfsatte...@gmail.com sub 4096R/23804EE9 2013-01-06 [expires: 2018-01-06] -- Android is increasing in popularity, but the open development platform that developers
Re: [Ganglia-developers] [Ganglia-general] Grid of Grids Broken Again in 3.6.0? Is this a different problem?
I have confirmed that this patch [1] broke writing of the root summaries for the top-level gmetad when in a grid-of-grids setup. What should we do? Revert the patch, attempt to debug it, or just log a github issue to track it for now? Regards, Nick [1] https://github.com/ganglia/monitor-core/pull/92 On Tue, Sep 24, 2013 at 12:40 PM, Nicholas Satterly nfsatte...@gmail.comwrote: Hi Illydth, You might have missed that the pull request that added the break back also added more logic to the endElement_GRID() function to fix double-writing of the last cluster. So yes, that break is meant to be there again. See https://github.com/ganglia/monitor-core/pull/73 However, what isn't clear is why there is a new grid-of-grids problem. I suspect that it relates to this pull request but I haven't been able to confirm this yet. See https://github.com/ganglia/monitor-core/pull/92 Regards, Nick On Fri, Sep 20, 2013 at 7:41 PM, Douglas Wagner dougla...@gmail.comwrote: So the last time I tried this upgrade thing (3.1.7 - 3.4.0) I was getting no grid of grids information. Ran across the fix with the help of others on the list and documented it here: http://sourceforge.net/apps/phpbb/ganglia/viewtopic.php?f=4t=16p=28 So now I've upgraded from 3.4.0 to 3.6.0. I have 2 new clients (RHEL6) that I'm implementing. Went through the build process and built out RPMs for RHEL6. Turned on GMOND and I'm not seeing either of the two systems reporting into the associated GMETAD. The Web Interface isn't updating with the new boxes. As I start going back through some of my past issues, I ran back across this where in 3.4.0 Grid of Grids was broken. And when I check the reported file and problem again I see the same old code (the break; at the end of the first switch block). Is this broken again in 3.6? or is this the correct code and I should be looking somewhere else for why my new RHEL6 clients aren't reporting to my GMETAD system? --Illydth -- LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99! 1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. http://pubads.g.doubleclick.net/gampad/clk?id=58041151iu=/4140/ostg.clktrk ___ Ganglia-general mailing list ganglia-gene...@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- gpg: using PGP trust model pub 4096R/1EE38BD9 2013-01-06 [expires: 2018-01-06] Key fingerprint = 3EE9 550D D9D8 DB65 58C2 B58D CE78 EC6C 1EE3 8BD9 uid Nicholas Satterly (Debian Key) nfsatte...@gmail.com sub 4096R/23804EE9 2013-01-06 [expires: 2018-01-06] -- gpg: using PGP trust model pub 4096R/1EE38BD9 2013-01-06 [expires: 2018-01-06] Key fingerprint = 3EE9 550D D9D8 DB65 58C2 B58D CE78 EC6C 1EE3 8BD9 uid Nicholas Satterly (Debian Key) nfsatte...@gmail.com sub 4096R/23804EE9 2013-01-06 [expires: 2018-01-06] -- Android is increasing in popularity, but the open development platform that developers love is also attractive to malware creators. Download this white paper to learn more about secure code signing practices that can help keep Android apps secure. http://pubads.g.doubleclick.net/gampad/clk?id=65839951iu=/4140/ostg.clktrk___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Possibility of using different serialization format than XDR
Further improvements could probably be had in the arena of node multi-tenancy and/or arbitrary node grouping/clustering. Could you expand on what you mean by multi-tenancy, please? I'm curious. --Nick. On 29 Jul 2013, at 19:21, Dave Rawks d...@pandora.com wrote: I'm still trying to figure out what you're trying to improve here? XDR seems like a fine, standard, lightweight serialization protocol to use. It is already implemented and we've already got some protocol handling for backwards compat for really old ganglia monitor clients. What is there to gain from switching aside from having some new and shiny that needs to be supported in addition to the existing stuff? We aren't serializing any custom data types or references or anything aside from some floats, ints, and a couple of strings. XDR compute overhead is not hurting performance, especially on modern hardware, the payloads aren't very big and the tuning of various check timings and metric validity timings further reduces the amount of chatter on the wire. If you want to introduce some more modern code to ganglia I think adding support for pushing gmond communications into a modern pub/sub message queue framework. I've never heard anybody have problems with our serialization, but there is frequent and often confusing troubleshooting around multicast vs unicast and the various infrastructural/configuration tweaks to make the most out of those. Further improvements could probably be had in the arena of node multi-tenancy and/or arbitrary node grouping/clustering. Maybe I'm missing something that you've said or implied already, but this just seems like change for the sake of change. -Dave On 07/28/2013 02:09 PM, Nikhil wrote: Hi, Thanks for response. I see there is no averseness to the idea of considering different serialization format/protocol. Before we have any contribution in terms of code/specifications, what would be the ideal choice among these : http://en.wikipedia.org/wiki/Comparison_of_data_serialization_formats for choosing the serialization format over the current XDR implementation in ganglia? As in like what is the current payload by XDR and what we should not intend to cross over, the performance overhead in processing and storing, the availability of libraries and ease of use being some of them that comes to thought of discussion. As Dave also mentions platform agnostic, portability (endianness?) and efficiency are also of the critical things to be considered. While ASN.1 does offer all of this, some of the others that I wanted to consider are : MessagePack and UBJson. Formats specs are described here for MessagePack http://wiki.msgpack.org/display/MSGPACK/Format+specification and for UBJson http://ubjson.org http://ubjson.org/ Let me know what do you all think would be the ideal choice. Thanks. On Sat, Jul 27, 2013 at 6:13 AM, Vladimir Vuksan vli...@veus.hr mailto:vli...@veus.hr wrote: I am not necessarily opposed to it if it's implemented in such a way not to break backwards compatibility. Someone would need to contribute some code. Vladimir On Fri, 26 Jul 2013, Dave Rawks wrote: I'm curious to hear what you think is going to be more efficient, platform agnostic and portable than XDR? ASN1 would be the only thing I would even consider using instead, but it is arguable whether it would be worth the pain of supporting more than one serialization format and it certainly doesn't seem sane to break all backwards compatibility to switch to something new unilaterally. ASN1 /might/ be a reasonable alternative to XDR, but I don't see what advantages this could possibly bring. -Dave On 7/26/13 10:46 AM, Nikhil wrote: Hi, Considering that we have better and compute efficient and binary serialization open formats out there . How hard would it to make Ganglia use them instead of XDR ? Can the serialization format engines be pluggable, instead of being closely integrated with XDR? Is it still worth continuing to stick with XDR? The intention is to understand and see the possibility and have a discussion what could be best to go with, if its appropriate. I am really hoping to see the reply from the authors of ganglia core :-) Thanks, Nikhil -- Get your SQL database under version control now! Version control is standard for application code, but databases havent caught up. So what steps can you take to put your SQL databases under version control? Why should you start doing it? Read more to find out. http://pubads.g.doubleclick.net/gampad/clk?id=49501711iu=/4140/ostg.clktrk ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] coverity
Hi Chris, I think it's a good idea. There are definitely some memory leaks that it would be good to track down. Maybe coverity could help. It's worth a try at least. --Nick. On Tue, Jul 23, 2013 at 11:25 AM, Chris Burroughs chris.burrou...@gmail.com wrote: coverity offers free scanning or open source projects. Is there any interest in adding the ganglia C code there? I think all that's required is one of the developers clicking 'sign up'. http://scan.coverity.com/ -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers -- gpg: using PGP trust model pub 4096R/1EE38BD9 2013-01-06 [expires: 2018-01-06] Key fingerprint = 3EE9 550D D9D8 DB65 58C2 B58D CE78 EC6C 1EE3 8BD9 uid Nicholas Satterly (Debian Key) nfsatte...@gmail.com sub 4096R/23804EE9 2013-01-06 [expires: 2018-01-06] -- See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831iu=/4140/ostg.clktrk___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Problems with GMOND leaking memory
Please send a copy of your gmond.conf file to the list. Your explanation of what you changed is difficult to follow. Regards Nick On 3 Jul 2013, at 02:42, Valter Silva valter.si...@movile.com wrote: I setup *gmond* with rpmbuild ganglia.spec, for centOS 5.9 and centOS 6.4 with ganglia.3.6.tar.gz. Everything looks fine, but when I didn't setup *deaf=yes *and didn't remove the related configuration like *listen* in *tcp *or *udp* the memory jump from 9MB to 10GB of memory using. And this crash many of my servers. Any idea why this happen ? And why is that ? -- Atenciosamente, logo.gif http://www.movile.com/ Valter Silva Analista de Infraestrutura Tel: +55 19. 9122-1822 Skype: valter.silva.movile valter.si...@movile.com facebook.png http://facebook.com/moviletwitter.pnghttp://www.twitter.com/movile linkedin.png http://www.linkedin.com/company/movile pinterest.pnghttp://www.pinterest.com/movile great-places-to-work.png environment.gif -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Ganglia in the EC2 cloud
Hi Demetri, Could you try building from my personal development branch? It is an up-to-date merge with Ganglia master with one additional potential bug fix ( https://github.com/satterly/monitor-core/commit/ed3ad9d57b1d582503ef0104e17f7919044c7617 ). If this version runs without segfaulting I'll push it to the ganglia feature/cloud branch. And thanks for the pull request. It seems that it needs to be rebased with master. However, if your testing of the above branch proves successful we can rebase your patch against that. Let me know how you get on. Regards, Nick On Mon, Jun 17, 2013 at 11:53 PM, Demetri Mouratis dmour...@gmail.comwrote: Nicholas Satterly nfsatterly at gmail.com writes: [1] https://github.com/ganglia/monitor-core/compare/master...feature/cloud Nick, Thanks for your work in implementing this feature. I'm in the same boat with a larg(ish) EC2 (VPC) deployment and sorely missing ganglia in this new environment. I've found and fixed one bug pertaining to localtime versus GMT in the EC2 apr request: https://github.com/ganglia/monitor-core/pull/112 Amazon expects all timestamps to be in GMT. Some of my hosts have non-GMT set localtimes (don't ask). Now I'm facing a consistent sefgfault when the number of nodes in the cluster is large (= 17). The error looks like: [discovery.ec2] Found 17 matching instances [discovery.ec2] adding i-10ad3c25, udp send channel private_ip 10.10.1.211:8649 [discovery.ec2] adding i-34296506, udp send channel private_ip 10.10.1.204:8649 [discovery.ec2] adding i-1894ff2a, udp send channel private_ip 10.10.1.240:8649 [discovery.ec2] adding i-1a94ff28, udp send channel private_ip 10.10.1.241:8649 [discovery.ec2] adding i-cc99f2fe, udp send channel private_ip 10.10.1.214:8649 [discovery.ec2] adding i-c81c8dfd, udp send channel private_ip 10.10.2.115:8649 [discovery.ec2] adding i-a2d36990, udp send channel private_ip 10.10.1.116:8649 [discovery.ec2] adding i-24235016, udp send channel private_ip 10.10.1.234:8649 [discovery.ec2] adding i-2401bc11, udp send channel private_ip 10.10.2.216:8649 [discovery.ec2] adding i-2a235018, udp send channel private_ip 10.10.1.235:8649 [discovery.ec2] adding i-3a01bc0f, udp send channel private_ip 10.10.2.217:8649 [discovery.ec2] adding i-3801bc0d, udp send channel private_ip 10.10.2.218:8649 [discovery.ec2] adding i-d27015e7, udp send channel private_ip 10.10.2.164:8649 [discovery.ec2] adding i-2823501a, udp send channel private_ip 10.10.1.238:8649 [discovery.ec2] adding i-3a07620f, udp send channel private_ip 10.10.2.177:8649 [discovery.ec2] adding i-422a4f77, udp send channel private_ip 10.10.2.64:8649 [discovery.ec2] adding i-3890f10a, udp send channel private_ip 10.10.1.102:8649 . . . [discovery.ec2] Refreshing node list... [discovery.cloud] access key=AKIAJNY4GBUKJRXY4JDA, secret key=DxvJ [discovery.ec2] using host_type [private_ip], tags [environment= TEST], groups [], availability_zones [] [discovery.ec2] using endpoint ec2.us-west-2.amazonaws.com - ec2.us-west-2.amazonaws.com [discovery.ec2] URL-encoded API request ec2.us-west-2.amazonaws.com? AWSAccessKeyId=AKIAJNY4GBUKJRXY4JDAAction=DescribeInstancesFilter.1.Name = instance-state- nameFilter.1.Value=runningFilter.2.Name =tag%3AenvironmentFilter.2.Value= TESTSignatureMet hod=HmacSHA256SignatureVersion=2Timestamp=2013-06- 17T22%3A41%3A39ZVersion=2012-08- 15Signature= O7qmbgbbZnMk8njNQiEo4YLlDIVhM9NAF4171NoMTj4%3D [discovery.ec2] HTTP response code 200, 99664 bytes retrieved Segmentation fault The crash is reproducible, happens in about 2 minutes after start and can be avoided by renaming one of the hosts environment= tags to remove it from the cluster. I haven't been able to come up with a fix for this issue but I'm sufficiently out of my depth at this point to ask for help. Thanks. -D -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers -- gpg: using PGP trust model pub 4096R/1EE38BD9 2013-01-06 [expires: 2018-01-06] Key fingerprint = 3EE9 550D D9D8 DB65 58C2 B58D CE78 EC6C 1EE3 8BD9 uid Nicholas Satterly (Debian Key) nfsatte...@gmail.com sub 4096R/23804EE9 2013-01-06 [expires: 2018-01-06] -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
[Ganglia-developers] Sending aggregated cluster metrics to Graphite
Hi, We're looking at using the support for sending ganglia metrics to graphite however I've just worked out that aggregated cluster are not sent. Can anyone explain why this might be the case? Could it be because you would actually need to send two metrics for every cluster metric ie. the num and sum? Even so, it that an issue? Thanks, Nick -- Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
[Ganglia-developers] Ganglia in the EC2 cloud
Hi, A few months back I mentioned that I'd modified gmond to dynamically discover its cluster peers by using the EC2 API to update the udp send channel configuration. Well, we've been running this in production at the Guardian for more than 3 months and it's been a great success. I think this would be a very useful addition to the Ganglia agent so I'm submitting the code to a separate branch called feature/cloud for review and feedback. Changes to gmond.c have been kept to a minimum [1] and it's all conditionally compiled using --enable-cloud at the moment. The cloud.c code which does most of the work will need to be refactored to move the EC2-specific code into a separate function so that it can be extended to use other (more standards-based) cloud API's that are available. eg. DeltaCloud and CIMI. I've written a wiki page that explains this stuff in more detail here ... https://github.com/ganglia/monitor-core/wiki/EC2-Discovery As I said, feedback (and enhancement requests) very welcome. Regards, Nick [1] https://github.com/ganglia/monitor-core/compare/master...feature/cloud -- gpg: using PGP trust model pub 4096R/1EE38BD9 2013-01-06 [expires: 2018-01-06] Key fingerprint = 3EE9 550D D9D8 DB65 58C2 B58D CE78 EC6C 1EE3 8BD9 uid Nicholas Satterly (Debian Key) nfsatte...@gmail.com sub 4096R/23804EE9 2013-01-06 [expires: 2018-01-06] -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Ganglia gmetad thread stuck at TCP SYN SENT
Thanks Kostas and Jonathan for your suggestions. I spent a quite a few hours on this and in the end decided that the gmetad was working as designed and that adding a specific timeout on a socket connection wasn't needed. This is because the kernel already times out socket connections that fail, or rather it times failures out and then retries several times until it finally gives up. The data collection thread then sleeps for a bit before trying again. My specific problem was that after sleeping the data thread was just retrying the same host it failed on last time which was the instance that had been terminated. This would inevitably fail at some point and the data thread would appear to hang. The solution was to modify gmetad to poll the most recently launched instance by looking at the GMOND_STARTED value which works well. Hopefully I'll find time to submit this code in a branch in the coming days/weeks. --Nick. On Tue, Feb 5, 2013 at 5:28 PM, Kostas Georgiou k.georg...@atreides.org.uk wrote: On Fri, Jan 25, 2013 at 12:45:10PM +, Nicholas Satterly wrote: Does anyone have any ideas of how the connection could at least be timed out? Keep in mind that the gmetad is multi-threaded so I'm pretty sure that rules out the use of SIGALRM. .., How could a 2 second timeout be enforced on this connect()? You set O_NONBLOCK on the socket before the connect, run select with a 2 sec timeout on the socket from there if you have a connection (depending on if select hit the timeout or not and what getsockopt for SO_ERROR returns) you set the socket back to blocking. Did you see any failures when the machine went away after the connect? I can't remember if we timeout while we are reading data from the scoket. -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers -- gpg: using PGP trust model pub 4096R/1EE38BD9 2013-01-06 [expires: 2018-01-06] Key fingerprint = 3EE9 550D D9D8 DB65 58C2 B58D CE78 EC6C 1EE3 8BD9 uid Nicholas Satterly (Debian Key) nfsatte...@gmail.com sub 4096R/23804EE9 2013-01-06 [expires: 2018-01-06] -- Free Next-Gen Firewall Hardware Offer Buy your Sophos next-gen firewall before the end March 2013 and get the hardware for free! Learn more. http://p.sf.net/sfu/sophos-d2d-feb ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
[Ganglia-developers] Ganglia gmetad thread stuck at TCP SYN SENT
Hi, We have a situation here where developers deploy a new version of their app in EC2 by spinning up instances running the new version, adding them to the auto-scaling group and once all looks good just terminating the instances with the old app version. Works great for them, however the ganglia gmetad's polling that cluster seem to hang during the socket connect to the old instances in SYN SENT status if they are in the middle of establishing the TCP connection just as the instance is being terminated. Does anyone have any ideas of how the connection could at least be timed out? Keep in mind that the gmetad is multi-threaded so I'm pretty sure that rules out the use of SIGALRM. I think the relevant code block is in the g_tcp_socket_new() function in lib/tcp.c here... /* Connect */ rv = connect(sockfd, s-sa, sizeof(s-sa)); if (rv != 0) { close (sockfd); free (s); return NULL; } How could a 2 second timeout be enforced on this connect()? Thanks in advance. --Nick. -- Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. ON SALE this month only -- learn more at: http://p.sf.net/sfu/learnnow-d2d ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] override_ip causing gmond to crash
I believe this was a problem caused by using the wrong APR pool in the apr_pstrcat() call. https://github.com/ganglia/monitor-core/pull/62 --Nick. -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] dynamic discovery of hosts in EC2
Hi Paul, Thanks for your feedback. That was the best solution I came up with too so I've added this in and it seems to work well. An added side-effect is that the file can also be used to troubleshoot if you need to know exactly where the gmond is sending its metrics too without having to run the agent in debug mode. Regards, Nick On Wed, Oct 10, 2012 at 1:38 PM, Paul Hewlett paul.hewl...@arm.com wrote: ** ** Hi Nick ** ** Modify gmond to write a special file /etc/ganglia/ec2.conf with the discovered instances and then modify gmetric to read that file – using a cmdline option perhaps This change should be lightweight enough for gmetric ** ** Regards ** ** -- Paul Hewlett X25250 http://www.theregister.co.uk/2012/06/25/rbs_natwest_what_went_wrong/ ARM Ltd 110 Fulbourn Road, Cambridge, CB1 9NJ Tel: +44 (0)1223 405923 skype: paul-at-arm www.arm.com ** ** ** ** *From:* Nicholas Satterly [mailto:nfsatte...@gmail.com] *Sent:* 10 October 2012 13:06 *To:* ganglia-developers@lists.sourceforge.net *Subject:* [Ganglia-developers] dynamic discovery of hosts in EC2 ** ** Hi, ** ** I've been hacking on the ganglia gmond code to get the agent to auto-discover other servers in its cluster when running in EC2 [1]. It works a lot like the way elasticsearch does [2]. ** ** Does anyone have any suggestions on how I might get gmetric to work in a scalable way if it can't rely on the UDP send destinations being listed in the gmond.conf file? It really is a show-stopper for us at the moment which is unfortunate because gmond would work brilliantly in EC2 with these changes. ** ** Thanks in advance, Nick ** ** [1] https://github.com/satterly/monitor-core [2] http://www.elasticsearch.org/guide/reference/modules/discovery/ec2.html and http://www.elasticsearch.org/tutorials/2011/08/22/elasticsearch-on-ec2.html ** ** ** ** ** ** -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] dynamic discovery of hosts in EC2
It's currently writing to /var/lib/ganglia/gmond-ec2.conf but I'm flexible... https://github.com/satterly/monitor-core/blob/master/lib/libgmond.c#L614 --Nick. On Fri, Oct 12, 2012 at 4:02 PM, Paul Hewlett paul.hewl...@arm.com wrote: Hi Alex You are correct - it should be /var/lib/ganglia/ec2.conf or maybe even /tmp/ganglia? Also If the data does not need to persist between reboots then it could be /dev/shm/ganglia/ec2.conf... Regards -- Paul Hewlett X25250 http://www.theregister.co.uk/2012/06/25/rbs_natwest_what_went_wrong/ ARM Ltd 110 Fulbourn Road, Cambridge, CB1 9NJ Tel: +44 (0)1223 405923 skype: paul-at-arm www.arm.com -Original Message- From: Alex Dean [mailto:a...@crackpot.org] Sent: 12 October 2012 15:56 To: ganglia-developers@lists.sourceforge.net Subject: Re: [Ganglia-developers] dynamic discovery of hosts in EC2 On Oct 10, 2012, at 7:38 AM, Paul Hewlett wrote: Hi Nick Modify gmond to write a special file /etc/ganglia/ec2.conf with the discovered instances and then modify gmetric to read that file - using a cmdline option perhaps This change should be lightweight enough for gmetric I haven't looked at this code specifically, but just a general suggestion: A process shouldn't typically be able to write to files in /etc. Any data that gmond needs to write out should probably go somewhere in /var. alex -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
[Ganglia-developers] dynamic discovery of hosts in EC2
Hi, I've been hacking on the ganglia gmond code to get the agent to auto-discover other servers in its cluster when running in EC2 [1]. It works a lot like the way elasticsearch does [2]. To get it to work, you add the following stanzas to the gmond.conf... /* Dynamic discovery for cloud environments */ cloud { aws_access_key = INSERT_YOUR_ACCESS_KEY aws_secret_key = INSERT_YOUR_SECRET_KEY } discovery { type = ec2 /* only ec2 API supported so far */ # endpoint = https://ec2.amazonaws.com /* only required if in us-east-1 */ tags = { stage:dev } /* stage:prod */ groups = { quicklaunch-1 } /* security groups */ availability_zones = { us-east-1d } /* eg. eu-west-1a */ discover_every = 90 host_type = public_dns /* private_ip, public_ip, private_dns, public_dns */ port = 8649 } Then at start-up, gmond uses the filter defined by combining the tags, groups and availability zones that you define in the discovery section to find the list of matching EC2 instances using the EC2 API. Whenever a new instance comes up (as part of a scaling group, or whatever) and sends metrics to existing instances it triggers those gmonds to do another discovery which should find the new server. It will also do a rediscovery every so often (by default every 90 seconds) so that instances that have been terminated are removed from its list of UDP send destinations. This all works really well so far. The only thing I can't work out is how to support gmetric. If I understand gmetric correctly it works out what the UDP send destinations should be by reading in the gmond.conf file. However, if gmond is using EC2 discovery there are no static destinations listed. One solution might be for gmetric to query the EC2 API for the list the same way gmond does but this would add quite an overhead to a lightweight CLI. Also, we use gmetric quite a lot (called 1000's of times a minute) on some servers which would not scale if each gmetric exec had to query the EC2 API first. Does anyone have any suggestions on how I might get gmetric to work in a scalable way if it can't rely on the UDP send destinations being listed in the gmond.conf file? It really is a show-stopper for us at the moment which is unfortunate because gmond would work brilliantly in EC2 with these changes. Thanks in advance, Nick [1] https://github.com/satterly/monitor-core [2] http://www.elasticsearch.org/guide/reference/modules/discovery/ec2.html and http://www.elasticsearch.org/tutorials/2011/08/22/elasticsearch-on-ec2.html -- Don't let slow site performance ruin your business. Deploy New Relic APM Deploy New Relic app performance management and know exactly what is happening inside your Ruby, Python, PHP, Java, and .NET app Try New Relic at no cost today and get our sweet Data Nerd shirt too! http://p.sf.net/sfu/newrelic-dev2dev___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] override_ip causing gmond to crash
Hi, The version in APR instead of homegrown #49 is causing still causing corruption of the host name field on the server that I was having problems with before [1]. The current version in github is ... cb-msg.Ganglia_value_msg_u.gstr.metric_id.host = apr_pstrcat(gm_pool, (char *)( override_ip != NULL ? override_ip : override_hostname ), :, (char *) override_hostname, NULL); I've slightly modified the above version to the following and it seems to work ok... override_ip = override_ip != NULL ? override_ip : override_hostname; cb-msg.Ganglia_value_msg_u.gstr.metric_id.host = apr_pstrcat(gm_pool, override_ip, :, override_hostname, NULL); I assume there is some subtle difference between the two that someone on the developer list could explain to me. Do people think this would be robust enough to work is all cases? Regards, Nick [1] The HOST NAME tag was corrupted as follows... HOST NAME=U\xc2\xa69 IP= REPORTED=1348821943 TN=20 TMAX=20 DMAX=86400 LOCATION=unspecified GMOND_STARTED=0 TAGS=os:Linux datacentre:dev virtual:physical/HOST On Thu, Sep 27, 2012 at 10:23 AM, Nicholas Satterly nfsatte...@gmail.comwrote: Paul, thanks for that. However, I'd be more inclined to get the APR version working as it should. Vladimir, were there specific bug reports for gmond crashing? Or any more information to help us narrow down what the root cause may have been? --Nick. On Wed, Sep 26, 2012 at 9:20 AM, Paul Hewlett paul.hewl...@arm.comwrote: Hi Nicholas ** ** The +1 should be +2 in the malloc() call – one for the terminating null and one for the ‘:’ character. ** ** Regards ** ** ** ** -- Paul Hewlett X25250 http://www.theregister.co.uk/2012/06/25/rbs_natwest_what_went_wrong/ ARM Ltd 110 Fulbourn Road, Cambridge, CB1 9NJ Tel: +44 (0)1223 405923 skype: paul-at-arm www.arm.com ** ** ** ** *From:* Nicholas Satterly [mailto:nfsatte...@gmail.com] *Sent:* 26 September 2012 00:49 *To:* ganglia-developers@lists.sourceforge.net *Subject:* [Ganglia-developers] override_ip causing gmond to crash ** ** Hi, ** ** I've discovered that on some of our systems (perhaps only half a dozen out of 500 or so) gmond crashes if the override_ip configuration option is set. I've worked out that the problem is something to do with this block of code... ** ** #if 1 char* tmpstr = malloc( strlen(( override_ip != NULL ? override_ip : override_hostname )) + strlen( override_hostname ) + 1 );** ** strcpy (tmpstr, (char *)( override_ip != NULL ? override_ip : override_hostname ) ); strcat (tmpstr, :); strcat (tmpstr, (char *) override_hostname); ** ** cb-msg.Ganglia_value_msg_u.gstr.metric_id.host = tmpstr; #endif #if 0 cb-msg.Ganglia_value_msg_u.gstr.metric_id.host = apr_pstrcat(gm_pool, (char *)( override_ip != NULL ? override_ip : override_hostname ), :, (char *) override_hostname, NULL); #endif ** ** What I'm trying to understand at the moment is why the apr_pstrcat version is #if 0 commented out when it seems to work OK during my testing. ** ** Thanks, Nick -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. -- Got visibility? Most devs has no idea what their production app looks like. Find out how fast your code is with AppDynamics Lite. http://ad.doubleclick.net/clk;262219671;13503038;y? http://info.appdynamics.com/FreeJavaPerformanceDownload.html___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] override_ip causing gmond to crash
Paul, thanks for that. However, I'd be more inclined to get the APR version working as it should. Vladimir, were there specific bug reports for gmond crashing? Or any more information to help us narrow down what the root cause may have been? --Nick. On Wed, Sep 26, 2012 at 9:20 AM, Paul Hewlett paul.hewl...@arm.com wrote: Hi Nicholas ** ** The +1 should be +2 in the malloc() call – one for the terminating null and one for the ‘:’ character. ** ** Regards ** ** ** ** -- Paul Hewlett X25250 http://www.theregister.co.uk/2012/06/25/rbs_natwest_what_went_wrong/ ARM Ltd 110 Fulbourn Road, Cambridge, CB1 9NJ Tel: +44 (0)1223 405923 skype: paul-at-arm www.arm.com ** ** ** ** *From:* Nicholas Satterly [mailto:nfsatte...@gmail.com] *Sent:* 26 September 2012 00:49 *To:* ganglia-developers@lists.sourceforge.net *Subject:* [Ganglia-developers] override_ip causing gmond to crash ** ** Hi, ** ** I've discovered that on some of our systems (perhaps only half a dozen out of 500 or so) gmond crashes if the override_ip configuration option is set. I've worked out that the problem is something to do with this block of code... ** ** #if 1 char* tmpstr = malloc( strlen(( override_ip != NULL ? override_ip : override_hostname )) + strlen( override_hostname ) + 1 );*** * strcpy (tmpstr, (char *)( override_ip != NULL ? override_ip : override_hostname ) ); strcat (tmpstr, :); strcat (tmpstr, (char *) override_hostname); ** ** cb-msg.Ganglia_value_msg_u.gstr.metric_id.host = tmpstr;* *** #endif #if 0 cb-msg.Ganglia_value_msg_u.gstr.metric_id.host = apr_pstrcat(gm_pool, (char *)( override_ip != NULL ? override_ip : override_hostname ), :, (char *) override_hostname, NULL); #endif ** ** What I'm trying to understand at the moment is why the apr_pstrcat version is #if 0 commented out when it seems to work OK during my testing. ** ** Thanks, Nick -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://ad.doubleclick.net/clk;258768047;13503038;j? http://info.appdynamics.com/FreeJavaPerformanceDownload.html___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
[Ganglia-developers] override_ip causing gmond to crash
Hi, I've discovered that on some of our systems (perhaps only half a dozen out of 500 or so) gmond crashes if the override_ip configuration option is set. I've worked out that the problem is something to do with this block of code... #if 1 char* tmpstr = malloc( strlen(( override_ip != NULL ? override_ip : override_hostname )) + strlen( override_hostname ) + 1 ); strcpy (tmpstr, (char *)( override_ip != NULL ? override_ip : override_hostname ) ); strcat (tmpstr, :); strcat (tmpstr, (char *) override_hostname); cb-msg.Ganglia_value_msg_u.gstr.metric_id.host = tmpstr; #endif #if 0 cb-msg.Ganglia_value_msg_u.gstr.metric_id.host = apr_pstrcat(gm_pool, (char *)( override_ip != NULL ? override_ip : override_hostname ), :, (char *) override_hostname, NULL); #endif What I'm trying to understand at the moment is why the apr_pstrcat version is #if 0 commented out when it seems to work OK during my testing. Thanks, Nick -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers