Re: [Ganglia-developers] [Ganglia-general] [ANNOUNCEMENT] Ganglia meetup Tue Oct 21 in San Francisco (Quantcast HQ)
Also forgot to mention we have a code for FREE parking thanks to our friends of zirx[1], so if you are driving to SF for this meeting (like I am planing to do) all you really need is to install their app, hit the right address for quantcast in the map and hit the price to enter your code: GANGLIA so someone will be waiting for you at the door and take your car to a safe place. see you all in the other side, and lets have fun Carlo [1] http://zirx.com/ PS. you need an iphone or android phone to use their app though -- Comprehensive Server Monitoring with Site24x7. Monitor 10 servers for $9/Month. Get alerted through email, SMS, voice calls or mobile push notifications. Take corrective actions from your mobile device. http://p.sf.net/sfu/Zoho ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] moving mod_multicpu out of ganglia to ganglia-modules-linux
On Mon, May 14, 2012 at 01:17:19PM +0200, Daniel Pocock wrote: The mod_multicpu code in the main ganglia repo is Linux-only, while most of the other modules are cross-platform I think it might also work for cygwin but haven't really tried lately, if that is the case though it will remove this functionality from cygwin for no big gain IMHO. Most of the python modules are linux specific though, so would guess your comment was about native modules instead. The version in ganglia-modules-linux is based on the same code, with some small enhancements (using arrays instead of string comparisons) instead of having a forked version, why not make multi-cpu portable instead? and if you think your linux version is better, why not import it instead? having a mechanism to identify which OS is supported by each module was something that was missing in the modular architecture from the start (since it was modeled after apache that doesn't have that requirement) and adding this functionality instead of hacking around the lack of it would be IMHO a better option, eventhough that would most likely require a binary incompatible change and therefore a different (at least minor) version of ganglia, which seems is something we are fond of now anyway considering I'd seen some code released as 3.4 already. Carlo -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] new script for automating releases
On Fri, Apr 20, 2012 at 01:54:27AM +0200, Daniel Pocock wrote: I've generalised it for building just about anything that works with git/autotools and published it here: https://sourceforge.net/p/git2dist/code/?branch=ref%2Fmaster It's had two test runs today, flactag-2.0.1 and ganglia-3.3.7 there is either a bug on it, or a misunderstanding on how ganglia's agreed workflow [1] if this is too happen after a remote update : $ git branch * master $ git describe --tags 3.3.5-20-gb98de1e if you are going to suggest that I should be looking at the release/3.3 branch for the current HEAD of development or that this would be fixed by merging that back to master sometime later (which implies 3.3 will be abruptly EOL sometime in the future) then that should be spelled out and documented clearly before we found ourselves with a ganglia fork on our hands or an even bigger repository mess with duplicated commits or even worst, lost bug fixes. [1] http://sourceforge.net/apps/trac/ganglia/wiki/how_project_works -- For Developers, A Lot Can Happen In A Second. Boundary is the first to Know...and Tell You. Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! http://p.sf.net/sfu/Boundary-d2dvs2 ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] 3.3.5 released today
On Thu, Apr 05, 2012 at 12:52:21AM +0200, Daniel Pocock wrote: A number of bugs were found during the testing of 3.3.5 and discussed on the mailing lists. could a list of this bugs be published somewhere with the release, so that anyone knows what to expect if upgrading (most people probably still using patched 3.1.7 as that is what is provided by most distributions) from the top of my head there are : * 2 memory leaks (one probably only in deaf mode) * gmetad hierarchical mode is broken In other words, anyone who is using 3.3.1 or 3.3.0 should not get any new bugs from upgrading to 3.3.5 considering that 3.3.5 doubled the in memory size of each metric, it is likely to make the memory leaking problems worse though Carlo -- For Developers, A Lot Can Happen In A Second. Boundary is the first to Know...and Tell You. Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! http://p.sf.net/sfu/Boundary-d2dvs2 ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] bootstrap / autotools (renamed)
On Wed, Apr 04, 2012 at 06:51:27PM +0200, Daniel Pocock wrote: On 04/04/12 18:35, Dave Rawks wrote: be built only with specific versions from Debian current stable (squeeze) especially when the goal of the maintainers appears to be inclusion into testing/new-stable. Personally I would like for the entire tarball to have a nice sensible autotools based build install process ./configure make make install that can be quickly packaged up with a super minimal amount of distro specific munging. It does do that: someone who downloads the tarball should NOT run the bootstrap script (and should not have to run it). and that is why until recently wasn't included in the tarball, so no one would get confused. Carlo -- Better than sec? Nothing is better than sec when it comes to monitoring Big Data applications. Try Boundary one-second resolution app monitoring today. Free. http://p.sf.net/sfu/Boundary-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] autotools / git release branches / web version.php
On Mon, Mar 26, 2012 at 04:57:31PM +0100, Daniel Pocock wrote: Having the option to work both ways may just continue to create traps for people who know one half of the project and not so much about the other. ironically, the main driver for the 3.3.x series was to import the new web frontend, and it used to (mostly) work for the first 2 releases of that series while keeping the posibility of building independently. it would seem IMHO that all the extra hacking that was done to the build and release process, including the (mostly ignored) documentation hadn't improved on its reliability or clarity as shown by the fact that the last package release just masquerades the obsolete version 3.3.1 as 3.3.5 for web. Carlo -- This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] release/3.3 branch created
On Tue, Mar 27, 2012 at 10:46:51AM -0400, Vladimir Vuksan wrote: Therefore I'd like to dump branches for now and just stay on mainline. +1, keep it simple; and unless my view of git log is incorrect no feature (except some bugfixes) were added since 3.3.2 anyway, which might be the reason why the last release notes available (and that has already a due date that is 1 month old) hasn't been updated : https://github.com/ganglia/monitor-core/wiki/Release-Notes Carlo -- This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] 3.3.5 tagged
On Mon, Mar 26, 2012 at 04:50:18PM +0100, Daniel Pocock wrote: Release 3.3.5 The release has now been tagged in git commit = 9db9beea062c7ce5e5b4d10ed553c9b7cea7642e wrong bundle : carenas@dell ~/src/git/ganglia $ git describe --tags 3.3.5 carenas@dell ~/src/git/ganglia $ cd web/ carenas@dell ~/src/git/ganglia/web $ git describe --tags 3.3.2-3 while web has since had a lot more fixes added as shown by : carenas@dell ~/src/git/ganglia-web $ git describe --tags 3.3.4-14-g7383ed8 carenas@dell ~/src/git/ganglia-web $ git diff --stat 3.3.2-3.. | cat Makefile |2 +- api/host.php |9 ++--- cluster_view.php |4 ++-- functions.php| 15 +++ graph.php|5 +++-- header.php |1 + inspect_graph.php|4 ++-- templates/default/views_view.tpl | 16 8 files changed, 42 insertions(+), 14 deletions(-) Carlo -- This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] autotools / git release branches / web version.php
On Mon, Mar 26, 2012 at 04:23:46PM +0100, Daniel Pocock wrote: We need to get version into version.php all you need to do is run make in that directory and it will be done for you, if you follow the documentation. yes, the 3.3.4 release or lower is missing that, as well, because the procedure you follow to make the package was incomplete, but the wiki has been updated, and the file has been committed with the right version, so hopefully the 3.3.5 package will be fine. Carlo -- This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] 3.3.3 tagged
On Wed, Mar 21, 2012 at 07:59:16PM +0100, Daniel Pocock wrote: On 21/03/12 19:48, Vladimir Vuksan wrote: I agree with Alex. We are churning through too many versions. I would personally be OK with overriding the existing 3.3.2 tag and going with 3.3.2 instead of 3.3.4. Having been involved in the releases between 3.1.2 and 3.1.7, I accept some of the responsibility if people did find it problematic That is why I put out a test tarball, only tagged 3.3.3dp1, before tagging 3.3.3 - so people did have 24 hours to evaluate and that resulted (like in the 3.1.2 to 3.1.7 cycle) in a couple of obvious issues that were found after the release tag was made and therefore in a couple releases more. which probably point to the fact (which keeps getting ignored) that the testing community for ganglia is very small (per sourceforge download statistics they were 10 downloads for each on of those prereleases) and not able to respond in the timeline you suggest. specially when : * no information about what has changed is provided, so no one knows where to look * there is no standard battery test to run, neither enough time for testers to build their own package and deploy them in some test cluster to see how they behave. * the target audience for this product are sysadmins, and so providing binaries and making broader announcements (also including the ganglia-users) would be recommended so that prerelease testing is exhaustive. * there has been obviously little testing before making the release tar and so those few testers eventually get more tired as the releases keep increasing and demanding they start from scratch each time. the end result of course being that the quality of ganglia at release time is not what I am sure we all would like to see, and far from perfect. usually package maintainers don't even bother to get involved with prereleases, but would be IMHO and important part of that testing if we are to aim for a quality final releases. Carlo -- This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] 3.3.3 tagged
On Thu, Mar 22, 2012 at 01:52:04PM +, Daniel Pocock wrote: On 22/03/2012 13:33, Carlo Marcelo Arenas Belon wrote: On Wed, Mar 21, 2012 at 07:59:16PM +0100, Daniel Pocock wrote: On 21/03/12 19:48, Vladimir Vuksan wrote: I agree with Alex. We are churning through too many versions. I would personally be OK with overriding the existing 3.3.2 tag and going with 3.3.2 instead of 3.3.4. Having been involved in the releases between 3.1.2 and 3.1.7, I accept some of the responsibility if people did find it problematic That is why I put out a test tarball, only tagged 3.3.3dp1, before tagging 3.3.3 - so people did have 24 hours to evaluate and that resulted (like in the 3.1.2 to 3.1.7 cycle) in a couple of obvious issues that were found after the release tag was made and therefore in a couple releases more. Let's not have a discussion about `obvious' issues: Ganglia is supported on a large number of platforms, but I'm not sure if everyone here is testing every platform. I've never run it on AIX or a non-Intel based Linux, for example. obvious here means : * does it has the right version? * does it build? * can you make a package out of it? * does it require a flag day (compatibility or feature wise)? all of those SHOULD be resolved without having to bump a version number, because it is something we should be able to test before we make a package that is meant for public consumption. a version number change in a package is normally associated with changes on features or bugs, and therefore requires more focused testing than the above. specially when : * no information about what has changed is provided, so no one knows where to look There was previously a change log, I'm not sure what happened to that Do we just rely on the git logs (maybe a script to extract them to the web page too)? Or should someone be obliged to make a proper report with each release? https://github.com/ganglia/monitor-core/wiki * the target audience for this product are sysadmins, and so providing binaries and making broader announcements (also including the ganglia-users) would be recommended so that prerelease testing is exhaustive. I deliberately avoided that, because it should not be seen as an official release yet, and it could be tiresome helping less experienced users evaluate it. I would prefer to suggest that those sysadmins who want to test bleeding edge stuff join the dev list. missed the point, it is not that they want to test bleeding stuff, as much as we want them to test our bleeding stuff, so that when it gets released to the public all issues had been ironed out. usually package maintainers don't even bother to get involved with prereleases, but would be IMHO and important part of that testing if we are to aim for a quality final releases. I'm actually testing each of the 3.3.x series on OpenCSW. I hope to have binary packages in experimental very soon for people to try. and I am sure debian has experimental, and fedora has rawhide and opensuse has tumbleweed, and there are plenty other options we could be using to widen our test base if we would just meet their requirements and ask for their help. Carlo PS. I usually test in gentoo amd64, but not sure if ~amd64 would be the right place for the packages we put as prerelease -- This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] releasing 3.3.2 today?
On Tue, Mar 20, 2012 at 12:09:29PM +, Daniel Pocock wrote: Does anyone want to sneak in any last minute changes before I tag 3.3.2 and make the tarball available for testing? there is already a published tag named 3.3.2, if you are not going to release that then it will be better if we skip that release number and aim for 3.3.3 would recommend for testing you do tag it like 3.3.2pre1 or something like that as well. Carlo -- This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] releasing 3.3.2 today?
On Tue, Mar 20, 2012 at 05:36:56PM +, Daniel Pocock wrote: On 20/03/2012 17:34, Bernard Li wrote: On Tue, Mar 20, 2012 at 10:03 AM, Daniel Pocockdan...@pocock.com.au wrote: I agree with that approach, with a slight variation - I'll tag it as 3.3.3dp1 (after adding the ChangeLog file) Quick question -- does this prevent RPM upgrading? i.e. 3.3.3dp1 - 3.3.3? It is just a tag to help us keep track of what we test, it is not intended for versioning a binary package AFAIK the version of this release will be 3.3.2 since micro numbers no longer exist. Carlo -- This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Ganglia 3.3.1 configure.in broken, 3.3.2 needed
On Sat, Mar 10, 2012 at 04:25:16PM -0500, Vladimir Vuksan wrote: I am not married to package-ganglia-release so anything that helps us long term is a win. I think the problems are not with the tools but with the process and as was spelled out on the original list of bulletpoints. agree though that simplifying tools goes also together with making the process simpler and less prone to failures. On Sat, 10 Mar 2012, Daniel Pocock wrote: a) the tag doesn't cover the ganglia-web stuff this is not correct, but made complicated by the fact that the tags are not static, not standard (3.3.2-3 might had been better called 3.3.2.3 IMHO), and that they are not matching : $ git describe --tags 3.3.2 $ cd web/ $ git describe --tags 3.3.2-1 b) the tag is created before testing (which is not necessary when using git, you can tag after you test, because a tag is just a checksum of what you tested) more importantly, you could end up pushing that tag by mistake, and end up changing it later (something that wouldn't work since you can't force every clone to delete and reaquire that tag). if enforcing having a tag is going to be used in this way, will be better to stick to names like 3.3.2pre1 or use the old svn standard of not updating the main version after it has been tested and release prereleases with versions like 3.1.1.x (where x used to be the svn revision, but now would have to be a monotonically increasing number) I'd prefer using the 'pre' notation which is I submitted 74ddc9e so hopefully it will be more difficult to have a release like 3.3.1 where the version of the package (and associated libraries) was wrong. the most important part of this being of course, that it is possible to do more coordinated and exhaustive testing before release. Carlo -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Ganglia 3.3.1 configure.in broken, 3.3.2 needed
On Thu, Mar 08, 2012 at 04:34:19PM +0100, Daniel Pocock wrote: Michael, do you have write access on the wiki? I think we need to get this distribution-specific stuff captured there along with the general notes I provided below. having this instructions added to the codebase just like README.WIN is could help too, specialy considering there is a fair ammount of confusion now with information (not all of it consistent with each other) between the multiple wikis and website. Carlo -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Investigating feasibility of moving repo to Github
On Sun, Jul 10, 2011 at 04:28:18PM -0400, Vladimir Vuksan wrote: Any thoughts on why we shouldn't make the Github our primary repository ? as it was explained long time ago when I proposed the same and were rejected we will also need to change our scripts so that they will be able to work without subversion, to make a package. * the svn release is used as the MICRO release number * the changelog is created dynamically from svn log * the development tracks svn changelog numbers to keep track of merges from trunk, since svn is limited to do so automatically (until recently), but this last one is something that git would to automatically and will only reflect maybe in the way we do integrations and how they get approved using git allows for a much more dynamic development, specially since there are already significant codebases that are being maintained in parallel and using svn, limits how easily they could be integrated back. in that same line, it would be a good idea, most likely, to allow for some time for that people which had already diverged trees using svn to either got those changes integrated in there, or move them to their own trees using git, and so during that period, keeping both trees on sync somehow would be needed. Carlo -- All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Investigating feasibility of moving repo to Github
On Sun, Jul 10, 2011 at 09:27:28PM -0400, Jesse Becker wrote: My only concern is with the import process itself. any import process that I know of from svn to git, should at least preserve the history, what is your concern specifically here? There is a lot of important metadata in the existing SVN repository. are you referring to which files are executable or ASCII and stuff like that? tools should be able to translate them most likely into their corresponding git flags if you are talking about the external dependency to web/dwoo that was added in trunk and therefore now also in 3.2, that would need to be translated as well, but git submodules allows for that. I believe that this should be completely preserved, either directly within the git repository, or as a separate standalone (and frozen) SVN repository. The commit logs, test branches, and history is too important to lose. the test branches that are no longer open (because they were already merged back) wouldn't need to be migrated IMHO, as for the other branches that were open but never merged back, the should be probably migrated over as well as topic branches but later weeded out after their good parts had been merged back, to avoid confusion. git allows you to have infinite number of local branches on your repository anyway, for all topics you would feel like. Carlo -- All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Investigating feasibility of moving repo to Github
On Mon, Jul 11, 2011 at 01:05:47PM -0400, Jesse Becker wrote: On Mon, Jul 11, 2011 at 12:42, Carlo Marcelo Arenas Belon care...@sajinet.com.pe wrote: On Sun, Jul 10, 2011 at 09:27:28PM -0400, Jesse Becker wrote: I believe that this should be completely preserved, either directly within the git repository, or as a separate standalone (and frozen) SVN repository. The commit logs, test branches, and history is too important to lose. the test branches that are no longer open (because they were already merged back) wouldn't need to be migrated IMHO, as for the other branches that were open but never merged back, the should be probably migrated over as well as topic branches but later weeded out after their good parts had been merged back, to avoid confusion. Closed branches can remain closed, but I still think they should be kept as a record, if nothing else. For keeping a record of them, it would be easier to keep svn around in a read only way, as it is (nearly) imposible to reconstruct the merges and the full history in svn anyway, as it was only recently that metadata was added for keeping track of the merges. The equivalent on git for an svn merge operation without metadata (as was the default until very recently) is to do `git merge --squash`, which doesn't keep track of the development history, and so migrating those branches into git isn't very expresive and is instead just a waste. Carlo -- All of the data generated in your IT infrastructure is seriously valuable. Why? It contains a definitive record of application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-d2d-c2 ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Writing a Makefile for manpages in mans/
On Mon, Feb 28, 2011 at 12:11:16PM -0800, Bernard Li wrote: On Sat, Feb 26, 2011 at 8:56 AM, Carlo Marcelo Arenas Belon care...@sajinet.com.pe wrote: and also requires some post processing for the right formatting : ?http://www.mail-archive.com/ganglia-developers@lists.sourceforge.net/msg05612.html In the email, you said: after the binaries are build then pipe them to help2man, then some sed to replace the ` if I recall correctly. So do we want ` or not, because it looks like the code in trunk right now has `: obviously I didn't recall correctly ;), specially considering that the source of the backticks is actually gengetopt. I guess it all depends on the locale of your system. On my RHEL6b2 system, everything is single quote and there are no `. which version of help2man?, which locale, and which version of the binaries are you calling that don't have backticks in --help? Are you aware of any other post-processing that needs to be done? * removing the version (--version-string= could help as a workaround) * a descriptive name (-nmanual page for Ganglia Status Tool for gstat) * removing the misalignment that is added in the DESCRIPTION with the package name, version and the '.SS Purpose\n.IP' string which was removed with commit 1132 I think the only thing left is slightly better formatting for AUTHORS and COPYING then we can use those as template for the manpages (in my Makefile I generate a help2man include template that has AUTHORS, COPYING and BUGS). presume first creating suitable include files for -i What do you think? since I don't have such Makefile, there is not much I can comment on, but my attempt of generating a test updated man file for gstat showed the formatting was really off compared with the original, so hope you had better luck : $ help2man --version-string= -N -nmanual page for Ganglia Status Tool -i AUTHORS -i COPYING ./gstat ../mans/gstat.1 Carlo PS. using help2man 1.38.2 -- Free Software Download: Index, Search Analyze Logs and other IT data in Real-Time with Splunk. Collect, index and harness all the fast moving IT data generated by your applications, servers and devices whether physical, virtual or in the cloud. Deliver compliance at lower cost and gain new business insights. http://p.sf.net/sfu/splunk-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Dynamically resizable buffer for slurpfile()
On Thu, Feb 24, 2011 at 10:44:58PM +, Kostas Georgiou wrote: On Thu, Feb 24, 2011 at 12:25:16PM +, Carlo Marcelo Arenas Belon wrote: On Wed, Feb 23, 2011 at 09:42:56AM -0800, Bernard Li wrote: 123 read: 124 read_len = read(fd, db, buflen); 125 if (read_len = 0) 126 { 127if (errno == EINTR) 128 goto read; 129err_ret(slurpfile() read() error on file %s, filename); 130close(fd); 131return SYNAPSE_FAILURE; 132 } this code is not relevant as it is only called when EINTR is received because a signal interrupts the read call (very unlikely) Shouldn't this be if (read_len 0), a return of zero from read is possible (EOF for example). If slurpfile is called with buffer=NULL and buffsize equal or a multiple of the file size then we get SYNAPSE_FAILURE. The errno check will be against an old value of errno in this which makes it more likely to hit (still very unlikely though :) and then we have an infinite loop... good point, eventhough I would like to think that errno will be reset by the read call and avoid the infinite loop anyway. Committed revision 2494 Carlo -- Free Software Download: Index, Search Analyze Logs and other IT data in Real-Time with Splunk. Collect, index and harness all the fast moving IT data generated by your applications, servers and devices whether physical, virtual or in the cloud. Deliver compliance at lower cost and gain new business insights. http://p.sf.net/sfu/splunk-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Dynamically resizable buffer for slurpfile()
On Wed, Feb 23, 2011 at 05:12:03PM -0800, Bernard Li wrote: I tested under EL5 and EL6 and it was't able to get past the initial buffer size. ?I believe what I did was: Correction. It works on EL6, but not on EL5: most likely the test is just giving inconsistent results, and that is why now works in EL5, while it didn't before. [CentOS 5.5 x86_64 with kernel 2.6.18-194.32.1.el5] read(3, 2.6.18-194.32.1., 16) = 16 read(3, , 16) = 0 [RHEL6b2 x86_64 with kernel 2.6.32-37.el6.x86_64] read(3, 2.6.32-37.el6.x8, 16) = 16 read(3, 6_64\n, 16) = 5 The issue may be specific to files in /proc/sys, because I tried reading /proc/stat on CentOS 5.5 and it worked fine. very unlikely, and considering that this is some code modification you made and that you only have, the problem is most likely in your code anyway (maybe even miscompiled) Carlo -- Free Software Download: Index, Search Analyze Logs and other IT data in Real-Time with Splunk. Collect, index and harness all the fast moving IT data generated by your applications, servers and devices whether physical, virtual or in the cloud. Deliver compliance at lower cost and gain new business insights. http://p.sf.net/sfu/splunk-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Dynamically resizable buffer for slurpfile()
On Wed, Feb 23, 2011 at 09:42:56AM -0800, Bernard Li wrote: what second pass? ? dummy = proc_sys_kernel_osrelease; ? rval.int32 = slurpfile(/proc/sys/kernel/osrelease, dummy, ? ? ? ? ? ? ? ? ? ? ? ? ?MAX_G_STRING_SIZE); why would anyone call slurpfile in a loop anyway?, and slurpfile doesn't call itself recursively but just reads as much data as it can into the buffer provided (second parameter). Sorry I wasn't clear, I meant the goto read loop: 123 read: 124 read_len = read(fd, db, buflen); 125 if (read_len = 0) 126 { 127if (errno == EINTR) 128 goto read; 129err_ret(slurpfile() read() error on file %s, filename); 130close(fd); 131return SYNAPSE_FAILURE; 132 } this code is not relevant as it is only called when EINTR is received because a signal interrupts the read call (very unlikely) the second conditional after that code is used to continue reading the buffer after it is resized if that is possible and that works fine as shown by your tests 136if (read_len == buflen) 137 { 138 if (dynamic) { 139 dynamic += buflen; 140 db = realloc(*buffer, dynamic); 141 *buffer = db; 142 db = *buffer + dynamic - buflen; 143 goto read; 144 } else { 145 --read_len; 146 err_msg(slurpfile() read() buffer overflow on file %s, filename); 147 } 148 } When I straced the process, the first read() was able to read up to MAX_G_STRING, however, the second read() returns 0. However, if I read a regular file (not in /proc filesystem), it was able to read the rest of the string in the second pass just fine. this just sounds to strange, but was able to replicate it after a lot of guessing in a CentOS 5 VM (both 32bit and 64bit) as shown by : # strace -e read dd if=/proc/sys/kernel/osrelease bs=16 /dev/null read(0, 2.6.18-164.9.1.e, 16) = 16 read(0, , 16) = 0 so not a ganglia problem, and just a problem with the way you were trying to use slurpfile and the way that specific sysctl handler is implemented in that version of the kernel. makes sense anyway to not worry about partial reads from a value that is meant to be used whole anyway, but interestingly enough and as you reported later it is no longer working that way with newer kernels. Regarding this particular bug -- how should we fix this? There are currently two issues: 1) The OS release is truncated in the web frontend and that is to protect the gmond process against crashes 2) The warning slurpfile() read() buffer overflow on file /proc/sys/kernel/osrelease is displayed multiple times during RPM installation (possibly because gmond was called to generate conf files etc.) that was meant to be mostly informative, but the message might need to be reworked to be more effective. Can we potentially increase MAX_G_STRING or have proc_sys_kernel_osrelease buffer size resize dynamically? no Carlo -- Free Software Download: Index, Search Analyze Logs and other IT data in Real-Time with Splunk. Collect, index and harness all the fast moving IT data generated by your applications, servers and devices whether physical, virtual or in the cloud. Deliver compliance at lower cost and gain new business insights. http://p.sf.net/sfu/splunk-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] rpc/xdr.h missing #include rpc/types.h on Max OS X 10.6.4
On Sat, Jul 31, 2010 at 04:37:37PM -0700, Bernard Li wrote: if it looks good I'll check it into trunk: looks good, but haven't test it (don't have MacOS X anyway), with only the following comments : * probably better (as it will make for a faster configure) to have all AC_CHECK_HEADERS checks eventually together in one single macro instead of spread around. * eventually (assuming they are make to work again) same changes should be applied to xdr{client,server} in tests. also (since it is related to the rpc support anyway), wouldn't anyone have any objection on committing a fixed generated gm_protocol.h (and friends) instead of relying on the local rpcgen to generate them at build time? advantages will be that all hacks around at least cygwin's implementation of it will be removed and all dependencies to rpc/rpc.h will be truly dropped, but of course, the disadvantage would be (mostly philosofical) that an intermediate file would had been already propagated into each supported platform and so it would need to be tested to be valid and work correctly in all of them without having to rely on local platform knowledge. anyone patching gm_protocol.x could just regenerate it anyway, as far as the makefile rule is left there (which then could be a problem for some as well, if they have a clock skew problem that confuses make and forces recreating those files for no reason) and so customizations shouldn't be a concern AFAIK. Carlo -- The Palm PDK Hot Apps Program offers developers who use the Plug-In Development Kit to bring their C/C++ apps to Palm for a share of $1 Million in cash or HP Products. Visit us here for more details: http://p.sf.net/sfu/dev2dev-palm ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Bugzilla #264
On Mon, Jul 12, 2010 at 05:55:26PM -0700, Bernard Li wrote: Hi Carlo: On Fri, Jul 9, 2010 at 1:04 AM, Carlo Marcelo Arenas Belon care...@sajinet.com.pe wrote: bug is invalid, as it is the result of not indicating the right paths to use for the dependencies, if installing them through ports. the proposed fix will only remove the WARNING from configure, which is just a red herring since the code has all dependendant headers defined correctly regardless of the results from configure but if you really want to commit it, it wouldn't most likely do harm either. Initially I thought configure actually bailed with the warnings, however, checking with the user again this does not appear to be the case. there are only warnings and they are mostly just informative with the current implementation. So are you saying that if you did ./configure --prefix=/usr/local then the WARNING would not show up? no, the WARNINGs are not related to which flags are used in configure at all. in order to get a working build though, ./configure must be instructed where to find the dependencies (unless in /usr as usually happens in linux), hence why the report that it was failing to build is invalid. AFAIK that's what the user did too (even though it was not specified in the bug report), so I just wanted to confirm. If the warning still shows up it might be a good idea to check in the code if it doesn't break anything since less warning is good IMHO. as I said before already, it wouldn't most likely harm either so feel free to commit it so that a new bootstrapped snapshot could be tested in all supported platforms. Carlo -- This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Install gmond module config files, python modules and config files by default
On Mon, Jun 28, 2010 at 12:17:03PM -0700, Bernard Li wrote: On Sat, Jun 26, 2010 at 4:56 AM, Carlo Marcelo Arenas Belon care...@sajinet.com.pe wrote: this would trigger gmond to segfault unless it was linked against libconfuse 2.7 or it also has a default gmond.conf file created. Actually, gmond would segfault with or without this patch Right, forgot the problem was introduced and released with 3.1.7 when pushing by default the configuration for modpython (unless built with --disable-python) this is because the default configuration has the line: include ($sysconfdir/conf.d/*.conf) which causes libconfuse to segfault. not really; it is because the include is referencing that directory AND there are files on that directory that then get imported into and inserted into the configuration making libconfuse segfault. if the directory would not exist or be empty (as it was before) then gmond wouldn't crash. I was thinking that it does not make sense to include this in the default configuration file, which is used when no configuration file is found. The reason is if no configuration file is found, you would not expect to have other configuration files lying around to be included. make install is pushing configuration files and therefore there are other files lying around even if no configuration was created. One way to fix this, is to only include this line when we are trying to output this to standard out via `gmond -t`. This way gmond can still function without a configuration file, and the default configuration outputting still works as expected and users won't need libconfuse 2.7 to get gmond working the way it's supposed to be -- what do you think? interesting, but more of a hack around the problem than a solution. it also has the sideffect of changing the way `gmond -t` works and making the internal configuration invisible as there is no way anymore to print it, and therefore should need also most likely to be documented clearly to avoid surprises. agree though that at least remove the segfault by default and therefore is worth considering, even if probably a similar solution could be accomplished but not pushing by default configurations (which as I said your patch was encouraging instead) installing the example modules by default might not be a good idea, as they are just generating bogus metrics anyway. Agreed, but they are installed, but not turned on (note the pyconf.off extension). but they are example modules (AKA meant for reading and learning from, not running in a production setup) and would rather see pushed by a packager into /usr/share/doc/ganglia/examples or something similar than in the place where all other real modules are deployed. also, as you pointed out since these modules are linux specific and only needed on some setups they were intentionally not included in the default install as they are generally pulled as needed by the packager/sysadmins that are interested on them anyway. How about a new target like `make install_gmond_modules`? probably more of a `make install_extra_linux_modules`, which then would also pull the needed configurations and ensure that gmond can include and enable them all without crashing. The reason why I decided to do this is because users who build from source may not know about these modules, or know where they are supposed to go. So I thought it would be nice to make it easier for them to discover this. this seems like something that would be better to correct through documentation. don't forget also packagers (and sysadmins) are already pulling through their packages whatever they find useful and so this change will conflict downstream with their setups, while not providing the information that would allow otherwise be used by interested sysadmins to make their own educated decisions. Carlo -- This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Manpages in mans/ directory
On Wed, Jun 23, 2010 at 03:23:52PM -0700, Bernard Li wrote: I'm trying to get some manpage related bug fixes in and was wondering if someone could tell me how the manpages in the mans/ directory of our source tree are generated. after the binaries are build then pipe them to help2man, then some sed to replace the ` if I recall correctly. It looks like it's a combination of generation via help2man and manually adding some sections (like AUTHOR, BUGS, COPYRIGHT). help2man -i can be used for adding those extra sections; never bothered creating a Makefile though because the AUTHORS and COPYING files which would had been the sources for it are not really well maintained (BUGS was added and used as a source for example) Carlo -- This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Install gmond module config files, python modules and config files by default
On Fri, Jun 25, 2010 at 01:26:37PM -0700, Bernard Li wrote: The following patch will install the gmond module (including python) config files to the sysconfdir this would trigger gmond to segfault unless it was linked against libconfuse 2.7 or it also has a default gmond.conf file created. as well as the python modules to moduledir when `make install` is executed: installing the example modules by default might not be a good idea, as they are just generating bogus metrics anyway. also, as you pointed out since these modules are linux specific and only needed on some setups they were intentionally not included in the default install as they are generally pulled as needed by the packager/sysadmins that are interested on them anyway. Carlo -- This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] bootstrapping for 3.1.X series and 3.2.X
On Tue, Jan 05, 2010 at 02:42:28PM +, Daniel Pocock wrote: Carlo Marcelo Arenas Belon wrote: On Mon, Dec 28, 2009 at 10:51:51PM +, Daniel Pocock wrote: Carlo Marcelo Arenas Belon wrote: On Sun, Dec 06, 2009 at 09:28:04AM +, Daniel Pocock wrote: Ok, but if it is not locked down, let's consider some of the following: - document the version we expect agree, and that is what README.SVN is for, but first we have to decide which version to expect to begin with. guess that if we are going to use lenny then the defaults for that distribution should be documented then as a prerequisite here since anything older won't be tested anyway? automake: 1.10 (1.10.1) autoconf: 2.61 libtool: 1.5 (1.5.26) while the last official release (3.1.2) used instead : automake: 1.9 (1.9.6) autoconf: 2.59 libtool: 1.5 (1.5.22) and the de-facto standard (CentOS 4) used instead : automake: 1.9 (1.9.2) autoconf: 2.69 libtool: 1.5 (1.5.6) - maybe add some check to configure that warns if a different version of autotools is detected? configure doesn't depend autotools and so that would be the wrong place to put any checks, but configure.in does and there is where bootstrapping should be aborted using AC_PREREQ and friends if using the wrong versions. Ok, should we use AC_PREREQ for 3.1.6, are there any disadvantages? only if the macros will definitely break with an older version of autoconf as otherwise all we are doing is enforcing a recommendation and preventing anyone that might not have access to the newest version of autotools the posibility of getting their own bootstrap (not much of an issue if we also provide regular snapshots though). I've had a quick search for information on this - it appears that adding AC_PREREQ(2.61) would cause bootstrapping to fail on any older or newer version - only 2.61 would be supported. will fail if any older version is used, but work also for newer versions I think this is the right way to go, as it will prevent us from running in to the same issues again, and it will hopefully prevent people building trunk with a different version of autotools and creating bugs that no one else can re-produce. not really, all we are doing is preventing some developer to generate their own bootstrap of ganglia if they have only access to an older than 2.61 version of autoconf, even if it is very likely that 2.53 or older will be all that is really needed. if we are not providing periodic snapshots, then that developer won't be able to do any of the work he wanted to do, unless he upgrades his tools locally or gets a handle of another system where he can get a bootstrap (most likely installing debian 5 somewhere) and so we just made his life more difficult and probably even discourage him to scratch that itch. bootstrapping, doesn't mean releasing and so would expect release managers to use the versions or environment we know works, but that is something that can be done through process and documentation better than it can be done by code, hence why I suggest r2174 gets reverted. I think Debian 5.0 (lenny) is the final decision then Debian 5.0 (lenny) x86 (32-bit) right? I'm using lenny amd64 (64 bit) most of the time now, especially since the various browser plugins (e.g. Java) now support 64 bit Linux. the problems with the bootstrap of 3.1.2 might had been because of using a 64 bit bootstrap (as that was never seen when doing CentOS 4 x86), but if Debian 5 doesn't have that problem (we would have to confirm that the packages generated in x86 and amd64 are identical) then saying Debian 5.0 (lenny) should be enough to describe the suggested bootstrap environment. any final objections/comments? the only one I can think of is that we sometimes used to provide RPMs with the releases but that would be IMHO not that important considering that fedora/EPEL might be the package most people would use anyway and at least for fedora that used to be released fairly quickly after the source package was posted on our site as the fedora packagers are also actively involved in the list. Providing RPMs is probably much less important than having a stable bootstrap environment agree However, it would be good for packaging activities to continue, and I can't see why we can't script the release process so that it invokes the rpmbuild commands on a Fedora box over ssh. then you are going to need either 2 public resources for all release managers to use consistently or a coordinate release process were the package is generated and then independently binary packages are added to it before the announcement (which also means we have to agree on what is going to be used for building those RPM packages). Should we a) after fixing the other showstopper (fork issue), do we tag 3.1.6 and let people test a tarball from Debian 5 autotools?, or b) make another 3.1.5 tarball
Re: [Ganglia-developers] PATCH : Adding trends to Ganglia
On Tue, Jan 05, 2010 at 10:46:34AM +0100, Sebastien Termeau wrote: On Mon, Jan 4, 2010 at 10:03 AM, Carlo Marcelo Arenas Belon care...@sajinet.com.pe wrote: On Tue, Dec 29, 2009 at 02:49:28PM +0100, Sebastien Termeau wrote: OK, I will provide you with two new patches that include those remarks. BUG249 (the one about using tables for formatting of the host view) is IMHO already closed, and unless you really meant to (as I expected and asked before but got no confirmation) to be really an enhancement that would be released with some 3.1 version (most likely 3.1.7). if that is the case, please update the target on the bugs or if you can't do that let me know and I would do so and track the corresponding backport for the release. I agree. How do I change the target version? It is the version number in the bug description? no it is the Version field in the details section, which now says Trunk and should say instead 3.1.x; if you have problems changing that let me know and I'll do the honors. assume 3.1.7 would be OK since we are almost code freeze for 3.1.6, and then will prepare a backport patch that could be pulled manually and if you don't file one yourself too and which you are probably using already anyway for your local package. BUG250 will need an updated patch that can be applied cleanly to trunk so that it can be tested/enhanced further. I just submitted a new version of the patch. This one can be cleanly applied to trunk. I slightly modified the order in which thinks are done in graph.php in order to calculate the 'start' and 'end' values before calling the metric.php script. cool, will check it and commit it to trunk then if it is working, but I suspect someone with a better clue about UI design might have a word about it that I have before it can get into 3.1 I was also thinking of adding a third one with minimum, maximum and average. Do you think it might be interesting to have this graph also? AFAIK, those values are already in the metric graphs as numeric values, and the MAX is also graphed with a red line, is that what you were looking to add? Yes it is. I was thinking that maybe the normal graphs should not come with this max line. OK, I think this was done before, where the red line was actually a per cluster max or the real max (like 100% in a percentage metric) there were some patches also flying around to add those values to the Y axis which I am not sure got committed but which will be complimentary to your idea. And instead, we can provide a new 'trend graph' with MIN, MAX and AVG drawn as lines. in that case it would probably make more sense to have a checkbox in the bar to toggle trending ON/OFF so that all graphs in the host view will be showing either the normal or the trending graph, instead of having a link for each graph. Carlo -- This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] template-based metric definition with PCRE
On Mon, Dec 28, 2009 at 08:47:35PM +, Daniel Pocock wrote: Jesse Becker wrote: On Sat, Nov 28, 2009 at 08:42, Daniel Pocock dan...@pocock.com.au wrote: For those following trunk, you may need to bootstrap again, and make sure you have pcre available. I've linked gmond with libpcre so that it can dynamically match the metric names E.g., for the multicpu module, this is the only metric definition that needs to be given to enable all metrics on all cores: metric { name_match = multicpu_([a-z]+)([0-9]+) value_threshold = 1.0 title = CPU-\\2 \\1 } Oh, that's cool. +1 for me. I've backported to 3.1, that was a bad idea IMHO, not because the implementation is bad, but because 3.1.3^H4^H5^H6 has been delayed long enough that adding anything else to it this late and therefore resetting the testing cycle would be unwise; specially considering there are other fairly significant fixes/features waiting as well for backport as well. there is also the fact that there was a valid (sorta, even if no code was ever produced otherwise) comment on how this functionality should be made optional (just like python is) and that wasn't discussed further (except on this email after it was committed), neither corrected. lastly, this code makes using multicpu so easy that it will be fairly obvious the module never worked fine to begin with and so it would therefore make more sense to also backport the needed fixes in r2116 (still incomplete), and maybe even the configuration cleanup patches in r2118 which are also somehow related, and also consider better ways to protect users of other platforms than Linux and Cygwin from shooting themselves on the foot by trying to get that module loaded, and which is an even bigger issue. $ svn log -r2160 r2160 | d_pocock | 2009-12-28 20:43:54 + (Mon, 28 Dec 2009) | 1 line Patch for PCRE support (backport r2112 and r2119) you are missing also r2150 and r2156 and some yet not existent patches so that the dependency will be also in the RPM SPEC and documented in the configuration man page and other needed places. would suggest instead to revert this backport for now. I'd be interested in any feedback on the PCRE dependency. If necessary, the feature can be made into a compile time option so that gmond can build without it. Yes, an optional compile time option is the way to do this. Use it if present, but continue on without it if not present. Is PCRE not available on any platform that we want to support for 3.1? most likely available everywhere (just like python), but since not having it would most likely only imply that the use of the corresponding configuration wouldn't be possible it really makes sense to be considered optional. If not, then I'll leave the patch as it is, too many #ifdefs can make the code look messy. The current implementation tries default locations for pcre, or let's you specify your own version: ./configure --with-libpcre=/opt/pcre ideally all that should be needed will be to also have a --enable-pcre or equivalent flag to control how to disable support for this at compile time just like it is possible for python (and that has proven to be really useful for Solaris users AFAIK) being able to use then autoconf like #defines to either enable a dummy implementation of the missing functionality should be all that is needed and shouldn't made the code that ugly (unless it needs refactoring anyway) but I understand if you are looking instead to get the feature initially released without having this as a posibility. Carlo -- This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] bootstrapping for 3.1.X series and 3.2.X
On Mon, Dec 28, 2009 at 10:51:51PM +, Daniel Pocock wrote: Carlo Marcelo Arenas Belon wrote: On Sun, Dec 06, 2009 at 09:28:04AM +, Daniel Pocock wrote: Carlo Marcelo Arenas Belon wrote: On Wed, Nov 25, 2009 at 11:00:21AM +, Daniel Pocock wrote: b) should the choice of bootstrap environment be locked for all 3.1.X, and only changed when increasing the minor version number (e.g. when we go from 3.1 to 3.2)? no, but since our build system is full of hacks and not completely reliable it might be a good idea to test no issues are introduced when looking at a new version. Ok, but if it is not locked down, let's consider some of the following: - document the version we expect agree, and that is what README.SVN is for, but first we have to decide which version to expect to begin with. - maybe add some check to configure that warns if a different version of autotools is detected? configure doesn't depend autotools and so that would be the wrong place to put any checks, but configure.in does and there is where bootstrapping should be aborted using AC_PREREQ and friends if using the wrong versions. Ok, should we use AC_PREREQ for 3.1.6, are there any disadvantages? only if the macros will definitely break with an older version of autoconf as otherwise all we are doing is enforcing a recommendation and preventing anyone that might not have access to the newest version of autotools the posibility of getting their own bootstrap (not much of an issue if we also provide regular snapshots though). d) Can anyone volunteer to provide a stable bootstrap environment (e.g. a virtual server) just for Ganglia? Two such environments may be needed, one for trunk and one for the current release branch. Matt did offer an EC2 instance if we could agree on an OS version : http://www.mail-archive.com/ganglia-developers@lists.sourceforge.net/msg05271.html I suggested Debian 5.0 (more conservative) or Fedora 12 (to be updated more frequently) but as far as it is agreed, documented and reproducible anything should work. I prefer Debian 5.0 (lenny), that is what I have on my laptop, home PC and various other infrastructure that I use. Elsewhere I am using RHEL3/4/5. Debian 5.0 is also what is being used for bugzilla AFAIK and so that might be a good option for consolidation. Who controls access to the Bugzilla server? I wouldn't mind having use of that as a bootstrap environment. Matt would know, but I suspect that shell access might be probably problematic to get and therefore unless we are talking about some continuous build system like cruisecontrol or hudson making snapshots, it might be problematic otherwise. to easy using one of those systems r2144 (still incomplete) was committed but would be nice to know which direction we are going anyway and for now it would seem there is not much dialogue going on about the alternatives. We also have access to the OpenCSW build farm, and they are willing to consider applications for access by Ganglia developers, so we could look at that as a bootstrap environment. Bootstrapping is done only once per package and so wouldn't make sense to also do bootstrapping in Solaris. No, I wasn't suggesting we bootstrap separately for Solaris. I was just suggesting that we use the OpenCSW machine to bootstrap for all platforms. However, we would be stuck with whatever version of autotools is current in the OpenCSW environment, and any decision to change the version there would be out of our control. I think Debian 5.0 (lenny) is the final decision then Debian 5.0 (lenny) x86 (32-bit) right? any final objections/comments? the only one I can think of is that we sometimes used to provide RPMs with the releases but that would be IMHO not that important considering that fedora/EPEL might be the package most people would use anyway and at least for fedora that used to be released fairly quickly after the source package was posted on our site as the fedora packagers are also actively involved in the list. debian/ubuntu is usually also well represented, and that shouldn't be an issue for releases in debian 5 anyway. Should we a) after fixing the other showstopper (fork issue), do we tag 3.1.6 and let people test a tarball from Debian 5 autotools?, or b) make another 3.1.5 tarball using Debian 5 autotools, and put it in a separate location for people to test before we tag? Using debian for this release will break Solaris (I have a fix ready but not yet backported) and also AIX (which Michael is maintaining outside our tree and with patched generated based on the bootstrapping used for 3.1.2) : http://www.perzl.org/ganglia/ As I said in the STATUS file for 3.1, it would be better IMHO to delay this decision until 3.1.7 (which hopefully would also include support for AIX
Re: [Ganglia-developers] [RFC] two step gmond initialization
On Mon, Dec 28, 2009 at 11:05:36PM +, Daniel Pocock wrote: Carlo Marcelo Arenas Belon wrote: On Fri, Dec 18, 2009 at 04:18:16PM +, Daniel Pocock wrote: Carlo Marcelo Arenas Belon wrote: On Sun, Dec 13, 2009 at 10:49:00AM +, Daniel Pocock wrote: I could accept Brooks' solution, because it means gmond would only fail for something like out-of-memory, while any configuration failure, port in use, etc would cause it to fail before detaching. If gmond still fails silently in some cases, you have not accomplished the objective that you were trying to obtain with r2025 anyway. I agree - it doesn't completely meet my goal, but it does at least result in an error code for most types of bad configuration (or port in use) that part is OK, but you still have the added sideeffects of r2025 which would affect gmond in other interesting ways : * the metric (and module) initialization is now done by the parent and expected to be inherited by the child, this means for example that the parent will send (and receive) metric information (even before forking) * the suid is done by the parent and therefore the child isn't privileged (while the metric initialization was done as root), this would at least prevent anyone to bind gmond to privileged ports but also could result in complicated permission issues by metric collection scripts. as I said before I think the apr_poll issue with BSD should be taken as a warning of how the changes we were planning to do could have unintended sideeffects, and since moving the daemonization was only one way to solve the original problem, makes more sense to instead revert this change and evaluate alternatives. It is this line of argument, rather than the concerns about APR, that makes me think reverting the change completely might be the way to go for now, although the reason for the change is still a legitimate issue and can be tracked in bugzilla. agree, and I have to admit I am surprised this (which was my main argument) somehow wasn't made clear until now. indeed, the proposed alternative implementation of a fix was published just because I agree that this issue is legitimate a bug (even if there might not be a bugzilla for it) which needed to be corrected anyway. Maybe this type of disruptive change will have to come in 3.2, there we can look at the various phases of initialisation more closely, prompt people to review their modules, etc. I was looking forward for 3.2 being the windows native version and therefore if the problem with the initialization is solved in a windows incompatible way then we are going to be left with no other option than to do this disruptive change there anyway. Carlo -- This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] PATCH : Adding trends to Ganglia
On Mon, Dec 28, 2009 at 01:22:47PM +0100, Sebastien Termeau wrote: On Wed, Dec 16, 2009 at 1:18 PM, Carlo Marcelo Arenas Belon care...@sajinet.com.pe wrote: On Tue, Dec 15, 2009 at 02:32:07PM +0100, Sebastien Termeau wrote: Dear Ganglia Developers, Please find below a patch that brings trends to Ganglia. Really interesting, would you mind filing and enhancement bug on www.ganglia.info?, that would be also a great place for attaching those images you said were also needed. Just to inform you that I have submitted 2 enhancement requests: would assume you wanted them eventually included as part of some 3.1 release? (most likely 3.1.7), if so would be better to adjust the target. http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=249 committed in r2161, but as explained in bugzilla the use of TR/TR is invalid HTML4, and so unless we rely on browsers doing the right thing (which seems to happen at least on Firefox 3.5) it will need to be patched to do something a little more complicated or as ugly as the following hack (which again will rely on browsers doing the Right Thing (tm) while rendering) : --- web/templates/default/host_view.tpl 2009-12-29 02:30:30.0 -0800 +++ web/templates/default/host_view.tpl 2009-12-29 02:56:05.0 -0800 @@ -129,7 +129,7 @@ /A/TD {new_row} !-- END BLOCK : vol_metric_info -- -/TR +TD/TD/TR /TABLE /DIV !-- END BLOCK : vol_group_info -- http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=250 patch provided is broken as shown by : patch: malformed patch at line 104: diff -ur ganglia_trunk_tables/host_view.php ganglia/host_view.php but after massaging slightly I am still having the following comments : * with_trends should be a bool instead (true/false) * probably to avoid surprises it should be false in 3.1 when backported * the same TR abuse issue from BUG249 applies here * the use of magic constants for the prediction trends should be explained to allow for customization or not made configurable at all. * the hardcoded double the currently selected range should be made flexible somehow or at least explained in the graph to avoid surprises. * I am using rrdtool 1.3.8 so I only got projection but the use of the icons for it seemed strange and not particularly good looking looking forward for a newer version that could be more broadly tested though as the feature is definitely interesting. Carlo -- This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] gmetad and rrdtool scalability
On Thu, Dec 24, 2009 at 12:10:51PM +, Daniel Pocock wrote: Vladimir Vuksan wrote: The issue is value of this data. If these were financial transactions than no loss would be acceptable however these are not. They are performance, trending data which get averaged down as time goes by so loss of couple hours or even days of data is not tragic. I agree - it doesn't have to be perfect. still the current implementation has ways to go and should be most likely expanded for more data reliability as far as it doesn't cost to much. To come back to my own requirement though, it is about horizontal scalability. Let's say you have a hypothetical big enterprise that has just decided to adopt Ganglia as a universal solution on every node in every data center globally, including subsidiary companies, etc. No one really wants to manually map individual servers to clusters and gmetad servers. They want plug-and-play. the currently federated model of gmetad helps slightly in that respect as you would expect each one of the independent offices/units/datacenters would have 1 gmetad locally (as far as it is big enough to handle the load) to collect and aggregate data and 1 central gmetad that connects to all the leaves for the centralized view. of course you can also have more than 1 gmetad (even 1 per cluster per location) and make the gmetad hierarchy tree a little larger. They just want to allocate some storage and gmetad hardware in each main data center, plug them in, and watch the graphs appear. If the CPU or IO load gets too high on some of the gmetad servers in a particular location, they want to re-distribute the load over the others in that location. When the IO load gets too high on all of the gmetads, they want to be able to scale horizontally - add an extra 1 or 2 gmetad servers and see the load distributed between them. horizontal scalability like these would be ideal, but again, the added complexity cost might be difficult to assimilate. Maybe this sounds a little bit like a Christmas wish-list, but does anyone else feel that this is a valid requirement? Imagine something even bigger - if a state or national government decided to deploy the gmond agent throughout all their departments in an effort to gather utilization data - would it scale? Would it be easy enough for a diverse range of IT departments to just plug it in? with enough planning and assuming the cluster tree is somehow balanced it should work fine IMHO, but for very large clusters or ones that span multiple locations and can't be split logically (clouds) you would soon run into scalability issues, including as well memory pressure in the gmond collectors. Carlo also made some comments about RDBMS instead of RRD. This raises a few discussion points: I meant RDBMs alongside RRDs, as RRDs were specially designed to allow for an efficient storage and summarization of metrics which is what is most of the time needed. For special cases where you need to have all data without any distortion for a long time, then an ETL process with a RDBMS and some datawharehouse is better fitted. The ETL could be as simple as scanning the RRDs periodically and importing the records into a database, but would be nice if this could be done directly from gmetad by allowing for hooks during write RRD time. This was indeed, one of the reasons why the python gmetad in trunk had a modular design, so that a module for doing that could be written if someone had interest on doing so. Carlo -- This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] gmetad and rrdtool scalability
On Sun, Dec 20, 2009 at 04:02:36PM +, Spike Spiegel wrote: On Mon, Dec 14, 2009 at 10:28 AM, Carlo Marcelo Arenas Belon care...@sajinet.com.pe wrote: b) you can afford to have duplicate storage - if your storage requirements are huge (retaining a lot of historic data or lot's of data at short polling intervals), you may not want to duplicate everything if you are planning to store a lot of historic data then you should be using instead some sort of database, not RRDs and so I think this shouldn't be an issue unless you explode the RRAs and try to abuse the RRDs as a RDBMs I think there's a middle ground here that'd be interesting to explore, altho that's a different thread, but for kicks this is the gist: the common pattern for rrd storage is hour/day/month/year and I've always found it bogus. I am sure the defaults provided were completely arbitrary (I think you missed week) but make sense based on the fact that there were the smallest time unit of their kind and also that they fit the standard gmond polling rates which wouldn't accommodate for 1 min or 1 sec. In many cases I've needed higher resolution (down to the second) for the last 5-20 minutes, then intervals of an hr to a couple hrs, then a day to three days and then a week to 3 weeks etc etc, which increases your storage requirements, but is imho not an abuse of rrd and still retains the many advantages of rrd over having to maintain a RDBMs. agree, and the fact that it is not easy enough to do or requires a somehow intrusive maintenance is a bug, but still possible for the reasons you explain. PS. I like the ideas on this thread, don't get me wrong, just that I agree ?? ??with Vladimir that gmetad and RRDtool are probably not the sweet spot ?? ??(cost wise) for scalability work even if I also agree that the vertical ?? ??scalability of gmetad is suboptimal to say the least. sort of. If you're looking at where your resources go to compute and deal with large amount of data, I agree. If you look at what it costs you or if it's even possible to create a fully scalable and resilient ganglia based monitoring infrastructure, I disagree. not sure what part are you quoting here, but I have the feeling we probably agree ;) getting my ganglia developer hat, I dislike the fact that gmetad can't scale horizontally like all well designed applications should, but the fact that there is no solution for it to do so yet, means that the complexity involved on making that change is probably not worth it in most (if not all) the cases considering that hardware (to the levels needed most of the time) is cheap anyway, as I really hope there is no one out there running gmetad in some big iron solution, when some decent PC box with enough memory would do mostly fine. there are problems as well with the way federation currently works which require more network bandwith and CPU that should be really needed and that I would guess we should tackle first, specially considering the increase of the XML sizes with 3.1 (which also has been worked around too) but for that (getting my ganglia user hat) would assume most big installations will stick with 3.0 anyway for now. Carlo -- This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] [RFC] two step gmond initialization
On Fri, Dec 18, 2009 at 04:18:16PM +, Daniel Pocock wrote: Carlo Marcelo Arenas Belon wrote: On Sun, Dec 13, 2009 at 10:49:00AM +, Daniel Pocock wrote: I could accept Brooks' solution, because it means gmond would only fail for something like out-of-memory, while any configuration failure, port in use, etc would cause it to fail before detaching. If gmond still fails silently in some cases, you have not accomplished the objective that you were trying to obtain with r2025 anyway. I agree - it doesn't completely meet my goal, but it does at least result in an error code for most types of bad configuration (or port in use) that part is OK, but you still have the added sideeffects of r2025 which would affect gmond in other interesting ways : * the metric (and module) initialization is now done by the parent and expected to be inherited by the child, this means for example that the parent will send (and receive) metric information (even before forking) * the suid is done by the parent and therefore the child isn't privileged (while the metric initialization was done as root), this would at least prevent anyone to bind gmond to privileged ports but also could result in complicated permission issues by metric collection scripts. as I said before I think the apr_poll issue with BSD should be taken as a warning of how the changes we were planning to do could have unintended sideeffects, and since moving the daemonization was only one way to solve the original problem, makes more sense to instead revert this change and evaluate alternatives. and it allows us to continue using apr (which some people have indicated a preference for). the solution I proposed doesn't remove the apr dependency, just doesn't use it for this specific case, because it is obvious it doesn't fit for what we need to, and we gain otherwise nothing from it (unless we would have a windows native version of gmond) it was also meant to be a temporary solution and the minimum change needed so that we can have : * 3.1.6 released quickly * the bug you were trying to solve still fixed for 3.1.6 ideally we should be able to make this work through apr in the long run (even if that means fixing apr), or if that is not possible rely on posix itself for getting windows compatibility for this part whenever the time comes to do that. The solution I proposed addresses the problem of reporting to the OS any failure while initialization (which was the original bug to fix anyway) in a straight forward way and is therefore the right way to correct this IMHO, without introducing any regressions by changing long relied upon semantics. Does anyone else have any feelings about this? I think we can choose from: - Carlo's solution (implement apr_proc_detach ourselves, calling process hangs around and uses socket to discover if daemon started successfully) not a socket but a pipe. - Brooks' solution (prepare sockets before detaching, prepare pollsets after detaching) - this allows us to continue using apr_proc_detach and not have native UNIX code this should work fine too (after all was the proposed option 3), but is really a fix for the bug introduced with r2025, instead of a fix to the original bug, hence why I don't really see how we can compare them both side by side. - Revert my change completely this was my suggestion for 3.1.6, so at least we will have a working gmond faster and be able to stabilize (both trunk and 3.1) further. since we haven't done this yet, testing any other changes in both trunk and 3.1 is impossible in BSD, and we had therefore implicitally dropped support for those platforms. I would like to make some kind of decision about what goes in 3.1.6 before Christmas, and maybe aim to tag 3.1.6 by 11 January, there is also the possibility that we can try to push it out more quickly, maybe tagging it 24 December and go GA in mid January? timeline will of course depend on the amount of changes involved, I am afraid also there has been almost no dialogue about the other showstoppers for 3.1.6 (like the bootstrapping issue) so there might be additional complications for this (I was indeed preparing some more build fixes to prevent more regressions if the original plan shown of using Fedora 9 with 3.1.5 are still in effect) Carlo -- This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] PATCH : Adding trends to Ganglia
On Tue, Dec 15, 2009 at 02:32:07PM +0100, Sebastien Termeau wrote: Dear Ganglia Developers, Please find below a patch that brings trends to Ganglia. Really interesting, would you mind filing and enhancement bug on www.ganglia.info?, that would be also a great place for attaching those images you said were also needed. It uses RRD's LSLSLOPE, LSLINT and PREDICT (requires RRD = 1.4.0 ) to provide two kinds of trends. Trends can be disabled by modifying conf.php ($with_trends). considering that RRD = 1.4.0 isn't that common yet would assume is probably better to have this turned off by default for now I also modified the host_view template to use tables instead of sending a BR after n metrics. would you mind doing this change in an independent that would be applied before the one that adds your feature?, it would be also better if the patch is done against svn trunk as it will be easier to integrate that way but shouldn't be that difficult to correct that either if you don't feel comfortable working with subversion or git (using git svn). more comments inlined with the code. --- diff -ur ganglia.ori/conf.php ganglia/conf.php --- ganglia.ori/conf.php2009-12-14 15:19:23.0 +0100 +++ ganglia/conf.php2009-12-15 12:05:49.0 +0100 @@ -64,6 +64,18 @@ # $show_meta_snapshot = yes; +# +# Show trends icons next to each single metric graph +# +$with_trends = yes; have you tested it this as false as well?, what about a version of RRD that doesn't support this?, would recommend having this off by default and adding a comment saying only to turn on if using the right version of rrdtool. @@ -140,6 +142,12 @@ # Get_context makes start negative. $start = $sourcetime + $start; } + +# For trends, we double the time range +if ($trend_type != ''){ + $rrdtool_graph['end']=$end + ($end-$start); +} + # Fix from Phil Radden, but step is not always 15 anymore. if ($range==month) $rrdtool_graph['end'] = floor($rrdtool_graph['end'] / 672) * 672; could you elaborate on why this is needed?, other than of course allow for a trend to be visible, should we allow the trend end target to be configurable instead? diff -ur ganglia.ori/host_view.php ganglia/host_view.php --- ganglia.ori/host_view.php2009-12-14 15:19:23.0 +0100 +++ ganglia/host_view.php2009-12-15 14:16:46.0 +0100 @@ -161,10 +161,26 @@ $tpl-newBlock(vol_metric_info); $tpl-assign(graphargs, $v['graph']); $tpl-assign(alt, $hostname $name); + huh? if (isset($v['description'])) $tpl-assign(desc, $v['description']); - if ( !(++$i % $metriccols) ) -$tpl-assign(br, BR); + # PREDICT supported in 1.4.0 + if ($with_trends == 'yes'){ + if( version_compare($version[rrdtool], '1.4.5') = 0) { 1.4.5? + $tpl-newBlock(trend_predict); + $tpl-assign(graphargs, $v['graph']); + $tpl-assign(images,./templates/$template_name/images); + } + else { + $tpl-newBlock(trend); + $tpl-assign(graphargs, $v['graph']); + $tpl-assign(images,./templates/$template_name/images); + } + } + if ( !(++$i % $metriccols) ){ + $tpl-gotoBlock (vol_metric_info); +$tpl-assign(new_row, /TRTR); + } who gets the last /TR addded to close the table? Carlo -- This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] gmetad and rrdtool scalability
On Mon, Dec 14, 2009 at 09:26:01AM +, Daniel Pocock wrote: Vladimir Vuksan wrote: I think you guys are complicating much :-). Can't you simply have multiple gmetads in different sites poll a single gmond. That way if one gmetad fails data is still available and updated on the other gmetads. That is what we used to do. That is a good solution under two conditions: a) you are only concerned with redundancy and not looking for scalability - when I say scalability, I refer to the idea of maybe 3 or more gmetads running in parallel collecting data from huge numbers of agents what is the bottleneck here?, CPUs for polling or IO?, if IO using memory would be most likely all you really need (specially considering RAM is really cheap and RRDs are very small), if CPUs then there might be somethings we can do to help with that, but vertical scalability is what gmetad has, and for that usually means going to a bigger box if you hit the limit on the current one. b) you can afford to have duplicate storage - if your storage requirements are huge (retaining a lot of historic data or lot's of data at short polling intervals), you may not want to duplicate everything if you are planning to store a lot of historic data then you should be using instead some sort of database, not RRDs and so I think this shouldn't be an issue unless you explode the RRAs and try to abuse the RRDs as a RDBMs of course that means you have to add a process to gather your metric data out of the RRDs to begin with and into your RDBMs but there shouldn't be a need to be concerned with RRDs storage size, when you are most likely going to be spending a lot more in that RDBMs storage (including snapshots and mirrors and all those things that make DBAs feel warm inside, regardless of budget) Carlo PS. I like the ideas on this thread, don't get me wrong, just that I agree with Vladimir that gmetad and RRDtool are probably not the sweet spot (cost wise) for scalability work even if I also agree that the vertical scalability of gmetad is suboptimal to say the least. -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] [RFC] two step gmond initialization
On Sun, Dec 13, 2009 at 10:49:00AM +, Daniel Pocock wrote: Carlo Marcelo Arenas Belon wrote: On Fri, Dec 11, 2009 at 01:31:22PM -0600, Brooks Davis wrote: On Fri, Dec 11, 2009 at 04:56:51PM +, Carlo Marcelo Arenas Belon wrote: I presume the reason why you haven't seen this show up in the APR list, is because it makes probably more sense for the apache httpd list instead for help understanding how apache is able to work around the leakiness of apr_poll and that also requires some reading from apache's code (which I am not at least that familiar with, neither really interested) Looking at the prefork mpm, the pollsets are created and used only in child_main() and thus are created after the fork. I suspect that changing the ganglia code to open all the sockets, but defer creation of the pollset until after fork is the right way to go. That is the way we did the initialization before r2025 so I guess that could explain why we weren't affected just like apache is not. Not quite - pre-r2025, we did this: a) detach b) socket init c) pollset init Post r2025: a) socket init b) pollset init c) detach Brooks' solution: a) socket init b) detach c) pollset init I could accept Brooks' solution, because it means gmond would only fail for something like out-of-memory, while any configuration failure, port in use, etc would cause it to fail before detaching. If gmond still fails silently in some cases, you have not accomplished the objective that you were trying to obtain with r2025 anyway. The solution I proposed addresses the problem of reporting to the OS any failure while initialization (which was the original bug to fix anyway) in a straight forward way and is therefore the right way to correct this IMHO, without introducing any regressions by changing long relied upon semantics. Basically, we would have to split the code in setup_listen_channels_pollset() into two functions, one that gets called before detaching, and one that is called after detaching. Why make the code more complicated, and are you really expecting to do that in scope for getting it backported into 3.1.6 considering how intrusive that would be? Also be aware there are bugfixes on that code that hadn't yet been backported and so you are going to either have to certify as well all those fixes or cherry pick the changes needed and test all different combinations. Carlo -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] [RFC] two step gmond initialization
On Fri, Dec 11, 2009 at 01:31:22PM -0600, Brooks Davis wrote: On Fri, Dec 11, 2009 at 04:56:51PM +, Carlo Marcelo Arenas Belon wrote: I presume the reason why you haven't seen this show up in the APR list, is because it makes probably more sense for the apache httpd list instead for help understanding how apache is able to work around the leakiness of apr_poll and that also requires some reading from apache's code (which I am not at least that familiar with, neither really interested) Looking at the prefork mpm, the pollsets are created and used only in child_main() and thus are created after the fork. I suspect that changing the ganglia code to open all the sockets, but defer creation of the pollset until after fork is the right way to go. That is the way we did the initialization before r2025 so I guess that could explain why we weren't affected just like apache is not. In the other hand though that change was introduced to force gmond to report to its parent in case there were problems creating those resources and that would be silently ignored otherwise, and I guess apache either has that bug as well or simply has a better way to code that notification just like the one that was proposed originally in this thread. Carlo -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] [RFC] two step gmond initialization
On Fri, Dec 11, 2009 at 09:40:53AM -0800, Bernard Li wrote: Wow... what a long thread... Sorry about that boss, but also sent an executive summary in : http://www.mail-archive.com/ganglia-developers@lists.sourceforge.net/msg05398.html Hope you don't mind reading instead that one or my small enough for a tweet comment on the STATUS page for 3.1 IMHO, the best solution here is to look at apache's main loop implementation and adapt our code. This way, (hopefully) we will get what we want (late initialization) without modifying any apr code. Carlo, since you seem to be on a roll here, could you please kindly: I think I'd made my point of view clear, including alternatives and a probe of concept implementation of my preferred resolution which still keeps a fix for the problem this regression was meant to help with. If you are interested on implementing alternative 3 (if that is even possible) or 4 feel free to do so, but considering this is a showstopper for 3.1.6 and that we (would assume) want to get that released without regressions and ASAP then would recommend instead we focus for now in a fix for this regression which will mean either : 1) revert the code and delay a solution for the child notification issues 2) implement the child notification using something we know works like the code proposed (at least for now) instead of reordering the initialization and breaking all the BSD (and who knows what else) with that. Carlo -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] [RFC] two step gmond initialization
On Fri, Dec 11, 2009 at 08:59:56AM -0700, Brad Nicholes wrote: APR is designed to solve these problems in a cross platform way and we are proposing that we abandon the cross platform solution in favor of a platform specific solution. Just want to clarify here that it is not a platform specific solution as much as one based on fairly common UNIX standards and therefore supports most likely every single of the platforms we run on including cygwin. In any case to simplify testing and probing me wrong a snapshot from trunk with the patch included is available from : http://sajino.sajinet.com.pe/ganglia/ganglia-3.2.0.0.tar.gz There is yet no native windows ganglia (even if some work has been done already to have a native metric version of libmetrics at least on trunk), neither a novell network version (and eventhough I would be interested on at least adding it to libmetrics lack the access needed to a development environment which will allow one to exist most likely as noticed by the lack of interest on it, otherwise) and so most of the portability of APR (which in this case is just a small wrapper around fork) is not helping much yet. I know that httpd doesn't have these issues and they detach and run just fine across a wide variety of platforms including windows, BSD, solaris, etc. right, and that is why alternative 3 in : http://www.mail-archive.com/ganglia-developers@lists.sourceforge.net/msg05398.html says, look at the apache httpd implementation of their main loop and make ganglia's similar so it will work with APR AS-IS. Why are we having these problems when httpd doesn't? Is the real solution as simple as going to the APR mailing list and asking why this issue exists in APR and if there is a workaround? I haven't really seen this issue show up on the APR mailing list so far or did I miss it? it is obvious, as you explained before, that apache uses APR in a different way than ganglia and that is why there is no bug to fix here in APR (except the fact that the implementation of apr_poll is leaky as it is inconsistent between platforms), if there is a bug I would say it is in the BSD implementation of kqueue with its non inheritable file handles. I presume the reason why you haven't seen this show up in the APR list, is because it makes probably more sense for the apache httpd list instead for help understanding how apache is able to work around the leakiness of apr_poll and that also requires some reading from apache's code (which I am not at least that familiar with, neither really interested) One of the problems that we already have with gmond is that there is already too much platform specific code in it which is why we have to rely on cygwin in order to run on windows. ganglia is a monitoring application, and therefore it is very likely to have to work with very platform specific stuff anyway (unlike apache), I agree though that using cygwin for windows is not ideal and I hope it will be deprecated sometime when a native windows version would be available, but APR so far hasn't help much in that direction AFAIK. It is also the reason why gmetad doesn't really run on windows because it wasn't built on top of a cross platform solution. My gut feel is that we should be moving ganglia more towards APR rather than away from it. gmetad could be made to run on windows, and from time to time some pure soul succeeds and then realizes why it was still reported as dont even try. AFAIK the original reason behind having and alternative python implementation of gmetad was to have to avoid having to go through the pain of cleaning that code for portability, noting that it mostly works almost reliably in linux at best, still I agree APR would be most likely part of a portable gmetad if needed. Carlo -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] [RFC] two step gmond initialization
Greetings, in case it wasn't obvious, and to celebrate the 1 week anniversary for this email, RFC means Request for Comments, and so if you have any about the code (which I even sent with an obvious bug to encourage the usual bikesheeding) or design, a reply on it (better if to the original email so it can be referenced as part of the discussion for context) would be appreciated; specially considering this is a showstopper for 3.1.3^H4^H5^H6. (made a little more obvious with r2149) in the nutshell the solution proposed does : 1) get rid of the apr_proc_detach dependency which is useless anyway when all it does is to daemonize the process and we even have an implementation for that in our code that is now only used by gmetad. 2) implement the forking / IPC using plain standard unix calls instead for portability. 3) create a variable that will be used in all error paths to indicate initialization failure and communicate that from child to parent through a pipe so that the parent can report failure to the OS if needed. 4) patch all error paths to use the new semantics. the alternatives will be in order of preference : 1) revert the current implementation and delay a solution. 2) drop this feature and maybe hack it with some init script logic 3) reimplement gmond to be more apache like so that APR magically works 4) implement it using APR after fixing APR first if possible 5) ignore the problem and tell BSD users to run gmond in the foreground and deal with it. Carlo -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] bootstrapping for 3.1.X series and 3.2.X
On Sun, Dec 06, 2009 at 09:28:04AM +, Daniel Pocock wrote: Carlo Marcelo Arenas Belon wrote: On Wed, Nov 25, 2009 at 11:00:21AM +, Daniel Pocock wrote: b) should the choice of bootstrap environment be locked for all 3.1.X, and only changed when increasing the minor version number (e.g. when we go from 3.1 to 3.2)? no, but since our build system is full of hacks and not completely reliable it might be a good idea to test no issues are introduced when looking at a new version. Ok, but if it is not locked down, let's consider some of the following: - document the version we expect agree, and that is what README.SVN is for, but first we have to decide which version to expect to begin with. - maybe add some check to configure that warns if a different version of autotools is detected? configure doesn't depend autotools and so that would be the wrong place to put any checks, but configure.in does and there is where bootstrapping should be aborted using AC_PREREQ and friends if using the wrong versions. c) what environment should be used to bootstrap 3.2.X/trunk? the same than 3.1 so that all improvements in the build system will be tested there and then backported for stability. Not necessarily - changes can be backported and then tested on the release branch before it is frozen/tagged for a release candidate. this will violate your rule of same autotools per branch but frankly I don't care as far as we allocate for the extra time that will be needed to certify the new bootstrap environment works. That would allow more aggressive changes to be implemented in trunk that are not intended for backport. trunk has several changes that are not intended for backport already and they are not intended for release either or we will have a 3.2 branch already. d) Can anyone volunteer to provide a stable bootstrap environment (e.g. a virtual server) just for Ganglia? Two such environments may be needed, one for trunk and one for the current release branch. Matt did offer an EC2 instance if we could agree on an OS version : http://www.mail-archive.com/ganglia-developers@lists.sourceforge.net/msg05271.html I suggested Debian 5.0 (more conservative) or Fedora 12 (to be updated more frequently) but as far as it is agreed, documented and reproducible anything should work. I prefer Debian 5.0 (lenny), that is what I have on my laptop, home PC and various other infrastructure that I use. Elsewhere I am using RHEL3/4/5. Debian 5.0 is also what is being used for bugzilla AFAIK and so that might be a good option for consolidation. We also have access to the OpenCSW build farm, and they are willing to consider applications for access by Ganglia developers, so we could look at that as a bootstrap environment. Bootstrapping is done only once per package and so wouldn't make sense to also do bootstrapping in Solaris. having the OpenCSW build farm as part of our test builds would be a great way to ensure Solaris users are better supported though. Carlo -- Join us December 9, 2009 for the Red Hat Virtual Experience, a free event focused on virtualization and cloud computing. Attend in-depth sessions from your desk. Your couch. Anywhere. http://p.sf.net/sfu/redhat-sfdev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
[Ganglia-developers] [RFC] two step gmond initialization
Greetings, the following patch (which is never meant to be committed, and is therefore very ugly in purpose) is a proof of concept for an alternative to the recent problematic feature proposal of returning failure status for gmond and that is part of 3.1.3^H4^H5. it has been tested on Linux amd64 and OpenBSD amd64 and applies to trunk (includes reverting r2025 for simplicity). it replaces apr_proc_detach with an inline implementation of it on plain POSIX and that should be most likely as portable (at least for the platforms we care of) and doesn't intentionally include any error checking to make it obvious functionality wise and has been implemented by brute force search and replace and therefore is definitely missing several other interesting failure paths. Carlo --- Index: lib/error_msg.c === --- lib/error_msg.c (revision 2133) +++ lib/error_msg.c (working copy) @@ -21,6 +21,7 @@ int daemon_proc;/* set nonzero by daemon_init() */ int ganglia_quiet_errors = 0; +int gmond_status = 0; static void err_doit (int, int, const char *, va_list); @@ -121,7 +122,8 @@ va_start (ap, fmt); err_doit (0, LOG_ERR, fmt, ap); va_end (ap); - exit (1); + gmond_status = 1; + exit (gmond_status); } /* Print a message and return to caller. Index: gmond/gmond.c === --- gmond/gmond.c (revision 2133) +++ gmond/gmond.c (working copy) @@ -84,6 +84,9 @@ /* The directory where DSO modules are located */ char *module_dir = NULL; +static int pipefd[2]; +extern int gmond_status; + /* The array for outgoing UDP message channels */ Ganglia_udp_send_channels udp_send_channels = NULL; @@ -214,6 +217,13 @@ char **gmond_argv; extern char **environ; +void gmond_terminate() +{ + if (daemon_proc) { +write(pipefd[1], gmond_status, sizeof(gmond_status)); + } +} + /* apr_socket_send can't assure all characters in buf been sent. */ static apr_status_t socket_send(apr_socket_t *sock, const char *buf, apr_size_t *len) @@ -263,7 +273,8 @@ exit(0); #endif err_msg(execve failed to reload %s: %s, gmond_bin, strerror(errno)); - exit(1); + gmond_status = 1; + exit(gmond_status); } /* this is just a temporary function */ @@ -317,9 +328,25 @@ if(!args_info.foreground_flag should_daemonize !debug_level) { char *cwd; + pid_t cpid; apr_filepath_get(cwd, 0, global_context); - apr_proc_detach(1); + pipe(pipefd); + cpid = fork(); + if (cpid 0) { + close(pipefd[1]); + read(pipefd[0], gmond_status, sizeof(gmond_status)); + close(pipefd[0]); + _exit(gmond_status); + } + atexit(gmond_terminate); + close(pipefd[0]); + chdir(/); + setsid(); + setpgid(0, 0); + freopen(/dev/null, r, stdin); + freopen(/dev/null, w, stdout); + freopen(/dev/null, w, stderr); apr_filepath_set(cwd, global_context); /* enable errmsg logging to syslog */ @@ -359,7 +386,8 @@ if(deaf mute) { err_msg(Configured to run both deaf and mute. Nothing to do. Exiting.\n); - exit(1); + gmond_status = 1; + exit(gmond_status); } } @@ -404,7 +432,8 @@ if(!acl) { err_msg(Unable to allocate memory for ACL. Exiting.\n); - exit(1); + gmond_status = 1; + exit(gmond_status); } default_action = cfg_getstr( acl_config, default); @@ -419,7 +448,8 @@ else { err_msg(Invalid default ACL '%s'. Exiting.\n, default_action); - exit(1); + gmond_status = 1; + exit(gmond_status); } /* Create an array to hold each of the access instructions */ @@ -427,7 +457,8 @@ if(!acl-access_array) { err_msg(Unable to malloc access array. Exiting.\n); - exit(1); + gmond_status = 1; + exit(gmond_status); } for(i=0; i num_access; i++) { @@ -440,7 +471,8 @@ /* This shouldn't happen unless maybe acl is empty and * the safest thing to do it exit */ err_msg(Unable to process ACLs. Exiting.\n); - exit(1); + gmond_status = 1; + exit(gmond_status); } ip = cfg_getstr( access_config, ip); @@ -449,7 +481,8 @@ if(!ip !mask !action) { err_msg(An access record requires an ip, mask and action. Exiting.\n); - exit(1); + gmond_status = 1; + exit(gmond_status); } /* Process the action first */ @@ -464,7 +497,8 @@ else { err_msg(ACL access entry has action '%s'. Must be deny|allow. Exiting.\n, action); - exit(1); + gmond_status = 1; + exit(gmond_status); } /* Create the subnet */ @@ -473,7 +507,8 @@ if(status != APR_SUCCESS) { err_msg(ACL access entry has invalid ip('%s')/mask('%s'). Exiting.\n, ip, mask);
Re: [Ganglia-developers] [Ganglia-general] Ganglia 3.1.5 beta ready for final testing
On Wed, Dec 02, 2009 at 07:41:39PM +, Daniel Pocock wrote: Therefore, the approach might need to be some combination of the solutions. E.g. a configure option that allows people to choose the new behaviour or the old behaviour. -1, this will double our supported paths for almost no gain and knowing that at least 50% are broken, and still underscores the nature of the problem. because changing the initialization would affect also (in a platform specific way) things like threaded gmond modules and the resources they rely on just as an example. As we know the new behaviour works on Solaris and Linux with the version of APR that was tested it with, which is also a moving target. then the package can be built the new way on those platforms by default. On BSD, users could choose what they want by setting a configure option. If a user had an updated apr (provided such update is feasible), they might compile with the new behaviour. again, this is not a BSD specific problem (indeed I suspect that solaris might be affected as well, specially in cases where APR was compiled to use port_getn), because then apr_poll_* has slightly different semantics than poll and therefore could result in platform specific failures that might not be as obvious as it was kqueue for the BSD. the problem that we were trying to solve was just to propagate correctly the status from the gmond daemon to the caller and for a proof of concept in that direction (as suggested before) refer to : http://www.mail-archive.com/ganglia-developers@lists.sourceforge.net/msg05390.html Carlo -- Join us December 9, 2009 for the Red Hat Virtual Experience, a free event focused on virtualization and cloud computing. Attend in-depth sessions from your desk. Your couch. Anywhere. http://p.sf.net/sfu/redhat-sfdev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
[Ganglia-developers] [RFC] status update for removing ganglia release names from the code
Jesse There is a backport request for 3.1 labeled build: remove ganglia release name from the code and that has a veto from you which I would like to see reconsidered. your objection refers to a thread[1] that includes the explanation of why this backport proposal is consistent with the consensus at that time (and which has since changed[2]) as it only removes the name from the web frontend configuration where it wasn't being used (dead code): http://www.mail-archive.com/ganglia-developers@lists.sourceforge.net/msg04719.html It is important to note that since the proposal has been stalled for a long time it won't be able to cleanly be backported from trunk and so to simplify the reviewing process a conflict free version of it is attached to this email. Carlo [1] http://www.mail-archive.com/ganglia-developers@lists.sourceforge.net/msg04697.html [2] http://www.mail-archive.com/ganglia-developers@lists.sourceforge.net/msg05246.html Property changes on: . ___ Modified: svn:mergeinfo Merged /trunk/monitor-core:r1703,1731 Index: configure.in === --- configure.in (revision 2135) +++ configure.in (working copy) @@ -84,7 +84,6 @@ AC_SUBST(GANGLIA_MINOR_VERSION) AC_SUBST(GANGLIA_MICRO_VERSION) AC_SUBST(GANGLIA_VERSION) -AC_SUBST(GANGLIA_RELEASE_NAME) AC_SUBST(REL) AC_SUBST(LIBGANGLIA_INTERFACE_AGE) Index: web/version.php.in === --- web/version.php.in (revision 2135) +++ web/version.php.in (working copy) @@ -6,6 +6,5 @@ $microversion = @GANGLIA_MICRO_VERSION@; $ganglia_version = @GANGLIA_VERSION@; -$ganglia_release_name= @GANGLIA_RELEASE_NAME@; ? -- Join us December 9, 2009 for the Red Hat Virtual Experience, a free event focused on virtualization and cloud computing. Attend in-depth sessions from your desk. Your couch. Anywhere. http://p.sf.net/sfu/redhat-sfdev2dev___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] [Ganglia-general] Ganglia 3.1.5 beta ready for final testing
On Wed, Dec 02, 2009 at 01:57:44AM +, Carlo Marcelo Arenas Belon wrote: On Tue, Dec 01, 2009 at 10:20:32PM +, Daniel Pocock wrote: - Can you easily re-compile APR with a different poll implementation? I think you can change it from configure. Which option?, --enable-other-child doesn't make a difference and considering how many different versions of APR are installed in all affected systems I would be surprised this to be an APR issue. and surprised I am, as the problem goes away if APR is forced to use poll instead of kqueue. but that of course requires a patched version of apr (including bootstrapping) and is probably not an option, unless we go back to the dark ages of including all dependencies statically. if anyone is interested I am attaching a patch for apr-1.3.9 which could be used to fix this problem in {Free,Net,Open}BSD and which will also require that ganglia be linked with the patched library by doing something like (using /opt/ganglia to avoid clashing with the system provided packages and ignoring the fact that you would need to be root with a bourne shell to execute the following incantation, and that is very unlikely to be a good idea anyway) : # mkdir -p /opt/ganglia # tar -xvzf apr-1.3.9.tar.gz # cd apr-1.3.9 # patch -p1 apr-1.3.9-configure-disablekqueue.patch # ./confgure --prefix=/opt/ganglia # make # make install # cd .. # tar -xvzf ganglia-3.1.5.tar.gz # cd ganglia-3.1.5 # ./configure --prefix=/opt/ganglia --with-libapr=/opt/ganglia/bin/apr-1-config # make # make install # LD_LIBRARY_PATH=/opt/ganglia/lib /opt/ganglia/bin/gmond Carlo PS. DragonFlyBSD will be still affected and MacOS X was probably luckily not --- apr-1.3.9/configure Mon Sep 21 14:59:34 2009 +++ apr-1.3.9/configure Wed Dec 2 01:45:45 2009 @@ -5762,6 +5762,10 @@ ac_cv_o_nonblock_inherited=yes fi + if test -z $ac_cv_func_kqueue; then +test x$silent != xyes echo setting ac_cv_func_kqueue to \no\ +ac_cv_func_kqueue=no + fi ;; *-netbsd*) @@ -5792,6 +5796,10 @@ ac_cv_o_nonblock_inherited=yes fi + if test -z $ac_cv_func_kqueue; then +test x$silent != xyes echo setting ac_cv_func_kqueue to \no\ +ac_cv_func_kqueue=no + fi ;; *-freebsd*) @@ -5838,15 +5846,12 @@ fi fi -# prevent use of KQueue before FreeBSD 4.8 -if test $os_version -lt 48; then - +# prevent use of KQueue if test -z $ac_cv_func_kqueue; then test x$silent != xyes echo setting ac_cv_func_kqueue to \no\ ac_cv_func_kqueue=no fi -fi ;; *-k*bsd*-gnu) -- Join us December 9, 2009 for the Red Hat Virtual Experience, a free event focused on virtualization and cloud computing. Attend in-depth sessions from your desk. Your couch. Anywhere. http://p.sf.net/sfu/redhat-sfdev2dev___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] [Ganglia-general] Ganglia 3.1.5 beta ready for final testing
On Wed, Dec 02, 2009 at 11:17:26AM +, Daniel Pocock wrote: Carlo Marcelo Arenas Belon wrote: On Wed, Dec 02, 2009 at 10:36:02AM +, Daniel Pocock wrote: Can you try re-enabling kqueue and patching apr to use rfork()? Doesn't work, and fails now on sending of the metrics, because of course this time the parent process close that socket and the child can use it after that. The only viable solution I see is to delay the creation of all the sockets until daemonized as it was being done originally. The problem with that is that if another process is already listening on one of the ports wanted by gmond, then the listener set up will fail, but if the problem is only detected after daemonizing, then the caller doesn't know about the failure. but that is something that could be fixed at the caller level but just checking if the port is bound to something already before calling gmond. agree that is not elegant, but is better than the current situation where you can't start gmond at all. If you really need to avoid having the parent report back on issues on that then you are going to keep the parent around and send the status back from the child until getting into the main loop through a unix socket or similar instead as you suggested originally was another option. That is not as easy to implement in apr as the apr_proc_detach() call. frankly I don't like much all the abstractions that apr_* provides because makes simple things like this more complicated (specially because of the unintended sideeffects) but since apr_proc_detach is just calling fork and reopening the 3 std filehandles shouldn't be that difficult to work around. apr_proc_fork() is described as the only call in apr that is not portable. apr_proc_create() could be used to invoke another gmond process, but I'm not sure that apr guarantees to preserve the file descriptors and memory allocations across that call. apr_proc_fork() is not called by apr_proc_detach() AFAIK, indeed I was surprised to see it even existed when noticed that apr_proc_detach calls fork() directly. Maybe the problem has something to do with the way detach recycles stdin/stdout/stderr? As a quick test, could you try modifying gmond.c so that it calls fork() directly rather than calling apr_proc_detach()? fork() doesn't work because the kqueue filehandle is not inherited; using rfork() instead doesn't either because all filehandles are closed by doing exit(0) in the parent and so fails in the same way that changing apr_proc_detach() does when changed to use rfork() instead. Carlo -- Join us December 9, 2009 for the Red Hat Virtual Experience, a free event focused on virtualization and cloud computing. Attend in-depth sessions from your desk. Your couch. Anywhere. http://p.sf.net/sfu/redhat-sfdev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] [Ganglia-general] Ganglia 3.1.5 beta ready for final testing
On Wed, Dec 02, 2009 at 11:48:51AM +, Daniel Pocock wrote: fork() doesn't work because the kqueue filehandle is not inherited; using rfork() instead doesn't either because all filehandles are closed by doing exit(0) in the parent and so fails in the same way that changing apr_proc_detach() does when changed to use rfork() instead. I'm not a BSD expert, do you know if there is any ioctl or something that can be used to tell BSD to keep the file descriptors for the child process? not a BSD expert either, but I would think that would be very unlikely. I would suggest reverting r2025 in trunk and start looking for an alternative solution, but would be probably just easier to revert r2043 for 3.1 as well to solve the release blocker, with the possibility of adding some logic to the init script to try to help with the test case you were trying to prevent by the original feature. Carlo PS. apache httpd must have a solution as they don't seem to have kqueue disabled, but that solution is probably just to delay the port binding as was done originally (except that they manage better the failures) -- Join us December 9, 2009 for the Red Hat Virtual Experience, a free event focused on virtualization and cloud computing. Attend in-depth sessions from your desk. Your couch. Anywhere. http://p.sf.net/sfu/redhat-sfdev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] vc++ build
On Mon, Nov 30, 2009 at 12:19:48PM -0500, Gladish, Jacob wrote: I believe this has come up in the past, but does anyone know if there's interest or any progress made on the native win32 build/port? there has been some slow progress for win32 support in general but using mingw instead (which is easier to work with than vc++ for an autotools based project) but all of this is highly experimental and only available on trunk (3.2). Carlo -- Join us December 9, 2009 for the Red Hat Virtual Experience, a free event focused on virtualization and cloud computing. Attend in-depth sessions from your desk. Your couch. Anywhere. http://p.sf.net/sfu/redhat-sfdev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Ganglia 3.1.5 beta ready for final testing
On Tue, Dec 01, 2009 at 10:20:32PM +, Daniel Pocock wrote: - Could it be a security issue? Can you try disabling setuid? It appears that listen channels are only set up after setuid, but maybe there is something else. still reproducible with setuid = off - Have you tried different versions of APR? E.g. on RHEL5, I test with the native apr-1.2.7, and on Debian I have 1.2.12-5 OpenBSD 4.5 comes with 1.2.11p2 but it also failed with a manually installed 1.3.8 - Can you easily re-compile APR with a different poll implementation? I think you can change it from configure. Which option?, --enable-other-child doesn't make a difference and considering how many different versions of APR are installed in all affected systems I would be surprised this to be an APR issue. - If you take 3.1.2 or another release and apply this patch only, do you see the same bug? yes Carlo -- Join us December 9, 2009 for the Red Hat Virtual Experience, a free event focused on virtualization and cloud computing. Attend in-depth sessions from your desk. Your couch. Anywhere. http://p.sf.net/sfu/redhat-sfdev2dev ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Ganglia 3.1.5 beta ready for final testing
On Mon, Nov 30, 2009 at 08:12:34AM +, Daniel Pocock wrote: Carlo Marcelo Arenas Belon wrote: On Sun, Nov 29, 2009 at 10:57:01AM +, Carlo Marcelo Arenas Belon wrote: On Tue, Nov 24, 2009 at 06:03:51PM -0800, Bernard Li wrote: Please help us test on as many OS/archs as possible, as this would go GA quite immediately ;-) FreeBSD is not able to return any XML data through TCP/8649 (tested with FreeBSD 8.0 amd64). the problem wasn't actually the TCP/8649 service but the fact that gmond was going into an infinite loop after sending the first metric update. the issue was tracked down to r2043 and a 3.1.5 development package with that patch reverted is available for testing from : http://sajino.sajinet.com.pe/ganglia/ganglia-3.1.5.2101.tar.gz Did you see this issue with 3.1.3 or 3.1.4? They both contain the same patch. Both 3.1.3 and 3.1.4 should have the same problem, but haven't been able to test 3.1.3 since it is no longer available. (FreeBSD 8 was just released a couple of days ago anyway). 3.1.4 shows the same behavior at least there and the fixed package seems to also work find with OpenBSD 4.4 amd64, NetBSD 4 i386 and DragonFlyBSD 2.4.1 i386 and amd64 (after also patched with r2124 to workaround BUG245). DragonFlyBSD fails to build but a 3.2 version of ganglia which includes fixes for that fails with the same TCP issue than FreeBSD and so this issue might be affecting other BSD as well. confirmed also to be affecting OpenBSD (tested with OpenBSD 4.5 amd64) but considering the nature of the fix wouldn't be surprised if other configurations were also affected. Are you proposing a fix or just revert the change? Your call, eventhough a fix for this feature will be probably preferred as there is nothing special about the BSD for them to be affected and it might be that the problem is therefore more generic. At least a revert would be needed for 3.1 as this accounts for a regression but haven't done so either waiting for you to first revert it on trunk and then decide on how to proceed from there depending on how critical this feature was for the release. The change has been working on Linux, Solaris and Cygwin. Other than just doing a manual bisect (using git instead of svn here would had been useful) to find where the problem was introduced and validate that reverting it corrects the problem haven't done much analysis of it, but the fact that it broke in such a strange way (was indeed expecting the culprit to be somewhere else, specially considering all recent changes in the networking and the fact that it seemed originally to be triggered by a TCP request) probably points to a bigger issue which just happens to have not been visible on the configurations used to test Linux, Solaris and Cygwin, specially considering how pervasive it was (broke all BSD I had access to test, at least) Carlo -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Ganglia 3.1.5 beta ready for final testing
On Mon, Nov 30, 2009 at 01:29:34PM +, Daniel Pocock wrote: Carlo Marcelo Arenas Belon wrote: Your call, eventhough a fix for this feature will be probably preferred as there is nothing special about the BSD for them to be affected and it might be that the problem is therefore more generic. It may be that this bug is revealing a more serious issue in the way initialisation is done, so I would prefer to know the real cause rather than just revert the change that forces the problem to show itself. agree and as I said before the reason why I didn't just revert it from trunk or 3.1 as a fix even if it seems to resolve the problem. At least a revert would be needed for 3.1 as this accounts for a regression but haven't done so either waiting for you to first revert it on trunk and then decide on how to proceed from there depending on how critical this feature was for the release. I agree that it is a recession, but reverting it may cause the real culprit to remain hidden. I'd rather hold the release while we look more closely. not sure if I understand what you meant here, since it would be obvious to me that 3.1.5 can't be released if a fix (even if it is just reverting the change) is committed. are you saying you want to hold of on deciding to release or not 3.1.5 or to see what will be in 3.1.6?, if the later I would suggest also pulling some other fixes and of course that would also require for us to agree on a bootstrapping environment for this release at least. The change has been working on Linux, Solaris and Cygwin. Other than just doing a manual bisect (using git instead of svn here would had been useful) to find where the problem was introduced and validate that reverting it corrects the problem haven't done much analysis of it, but the fact that it broke in such a strange way (was indeed expecting the culprit to be somewhere else, specially considering all recent changes in the networking and the fact that it seemed originally to be triggered by a TCP request) probably points to a bigger issue which just happens to have not been visible on the configurations used to test Linux, Solaris and Cygwin, specially considering how pervasive it was (broke all BSD I had access to test, at least) Can you provide output from strace/truss and also a stack trace from the point where it is in the infinite loop? filed BUG246 with the trace information (collected from OpenBSD 4.5 amd64) using ktrace, but you got me there. from the way the problem represents itself isn't really obvious were the offending code is and is difficult to debug as well since it dissapears when in debug mode or not running as a daemon, which is the reason why I haven't been able to capture a backtrace yet either. There is a good reason for moving the daemonize code the way I did - an alternative would be to daemonize, but make the original process hang around until the daemon process has entered the main loop. OK, and assume it is probably related to the cases were gmond suddenly dies at startup without notification but some clarification on what was the problem you were trying to solve would be probably usefull too. Carlo -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Ganglia 3.1.5 beta ready for final testing
On Tue, Nov 24, 2009 at 06:03:51PM -0800, Bernard Li wrote: Please help us test on as many OS/archs as possible, as this would go GA quite immediately ;-) FreeBSD is not able to return any XML data through TCP/8649 (tested with FreeBSD 8.0 amd64). DragonFlyBSD fails to build but a 3.2 version of ganglia which includes fixes for that fails with the same TCP issue than FreeBSD and so this issue might be affecting other BSD as well. Carlo -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
[Ganglia-developers] bind and bind_hostname parameters in udp_send_channel
Greetings, As part of 3.1.3 (then 3.1.4 and now 3.1.5) two additional parameters were added to the configuration for udp_send_channel which were not documented but that are otherwise very useful. after adding some basic documentation to trunk in r2122 and using them had found that the interface should be better improved before it gets released by either : * remove bind_hostname and overload that functionality on bind by defining a magic value which means (resolve default hostname) like . * keep bind_hostname but converted into a boolean so it can be set like all other flags in gmond.conf and better handle what to do when both parameters are provided (currently bind_hostname seems to silently override bind) Carlo PS. backporting the documentation to 3.1 should be done also once it is stabilized -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
[Ganglia-developers] RFC: release history in ganglia-3.1's STATUS file
Greetings, while looking at a fix for the broken gmond in at least some of the BSD platforms that was reported here : http://www.mail-archive.com/ganglia-developers@lists.sourceforge.net/msg05366.html noticed also that the STATUS file for the 3.1 branch had some confusing history and which I assume based on our own wiki [1] and a sample file from apache [2] (which was used as a basis for that) should be instead something like : Index: STATUS === --- STATUS (revision 2122) +++ STATUS (working copy) @@ -7,9 +7,10 @@ Release history: -3.1.5(hargrave) : Released: In Development -3.1.4(???): Released: Not released -3.1.3(avenger): Released: Not released +3.1.6(hargrave) : In Development +3.1.5(hargrave) : Tagged: Nov 24, 2009 +3.1.4(hargrave) : Tagged: Oct 26, 2009 (not released) +3.1.3(avenger): Tagged: Sep 19, 2009 (not released) 3.1.2(langley): Released: Feb 17, 2009 3.1.1(wien) : Released: Sep 10, 2008 3.1.0(amelia) : Released: Jul 30, 2008 the main differences and points for discussion being : * Until we get rid of the release names, and unless the release name changes it should be considered that the release name is the same (as reported during configure). * 3.1.3, 3.1.4 and 3.1.5 were released at least as betas and therefore the last version of that file misses that information. * 3.1.3 and 3.1.4 status of not released used to be no GA and that might a better way to identify releases that went through beta cycles but never went to GA. * 3.1.5 is either Released (included tagging and a beta package) or in development and since there was an announcement for testing I assume is at least in the same state than 3.1.3 and 3.1.4 were, and the fact that the GANGLIA_NANO_VERSION and GANGLIA_SNAPSHOT settings wasn't updated to reflect that was probably just an oversight which has been corrected in r2123 * 3.1.6 has no commits yet but should be open for development at least for bugfixes for 3.1.5 (if that gets scrapped) or to include other features/bugfixes which had been otherwise on halt since the feature freeze for avenger was called. in any case, looking forward for comments on this so that the fixes (if needed) can be committed but specially so it is clear on how to proceed until the GA status for 3.1.5 is decided. Carlo [1] http://sourceforge.net/apps/trac/ganglia/wiki/how_project_works [2] http://svn.apache.org/viewvc/httpd/httpd/branches/2.0.x/STATUS?revision=882861view=markup -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Ganglia 3.1.5 beta ready for final testing
On Sun, Nov 29, 2009 at 10:57:01AM +, Carlo Marcelo Arenas Belon wrote: On Tue, Nov 24, 2009 at 06:03:51PM -0800, Bernard Li wrote: Please help us test on as many OS/archs as possible, as this would go GA quite immediately ;-) FreeBSD is not able to return any XML data through TCP/8649 (tested with FreeBSD 8.0 amd64). the problem wasn't actually the TCP/8649 service but the fact that gmond was going into an infinite loop after sending the first metric update. the issue was tracked down to r2043 and a 3.1.5 development package with that patch reverted is available for testing from : http://sajino.sajinet.com.pe/ganglia/ganglia-3.1.5.2101.tar.gz DragonFlyBSD fails to build but a 3.2 version of ganglia which includes fixes for that fails with the same TCP issue than FreeBSD and so this issue might be affecting other BSD as well. confirmed also to be affecting OpenBSD (tested with OpenBSD 4.5 amd64) but considering the nature of the fix wouldn't be surprised if other configurations were also affected. Carlo CC Daniel as the release manager for 3.1.5 and author of the problematic feature. -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] bootstrapping for 3.1.X series and 3.2.X
On Wed, Nov 25, 2009 at 11:00:21AM +, Daniel Pocock wrote: a) is it preferred that we release 3.1.4 or that we release 3.1.5, or a third option, roll a 3.1.6 tarball using the same environment where 3.1.2 was bootstrapped? 3.1.2 had a bootstrapping problem which resulted on it failing to build by default on multilib amd64/i386 systems if both the 32bit and 64bit versions of the dependencies (libapr, confuse) were installed. 3.1.4 used the same bootstrapping than 3.1.1 and so was IMHO better, but because there were multiple 3.1.4 packages is probably difficult to know which one was validated, and that was AFAIK one of the reasons why it wasn't eventually released. b) should the choice of bootstrap environment be locked for all 3.1.X, and only changed when increasing the minor version number (e.g. when we go from 3.1 to 3.2)? no, but since our build systems is full of hacks and not completely reliable it might be a good idea to test no issues are introduced when looking at a new version. c) what environment should be used to bootstrap 3.2.X/trunk? the same than 3.1 so that all improvements in the build system will be tested there and then backported for stability. d) Can anyone volunteer to provide a stable bootstrap environment (e.g. a virtual server) just for Ganglia? Two such environments may be needed, one for trunk and one for the current release branch. Matt did offer an EC2 instance if we could agree on an OS version : http://www.mail-archive.com/ganglia-developers@lists.sourceforge.net/msg05271.html I suggested Debian 5.0 (more conservative) or Fedora 12 (to be updated more frequently) but as far as it is agreed, documented and reproducible anything should work. Carlo -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
[Ganglia-developers] Using git for ganglia source code management (was Re: [Ganglia-general] Ganglia 3.1.4 beta ready for testing)
Changing subject and list to better focus this thread On Mon, Nov 02, 2009 at 10:57:57PM +, Daniel Pocock wrote: The discussions about bootstrapping and versioning brings me to another issue - does anyone have any interest in using git instead of SVN? +1, but beware that the automatic ChangeLog generation, as well as release flows will need to be adjusted as they were designed around subversion. I notice it can do some handy tricks, like generating version numbers that reflect the tag you are building in it can also do some more interesting tricks, like pushing/pulling from a subversion server and so there is really no need to force anyone to migrate either. as well as all the benefits of distributed version control. which is IMHO the biggest selling point, as it allows for more participation as it is easier to contribute patches and maintain them even when not having access to the main repository. Carlo -- Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] [Ganglia-general] Ganglia 3.1.4 beta ready for testing
On Mon, Nov 02, 2009 at 03:05:32PM -0800, Bernard Li wrote: Can you please test this tarball bootstrapped on Fedora 9. It works, but would invalidate all testing that was done for 3.1.3 and the original 3.1.4. If it works I will replace the original tarball with this: http://ganglia.info/testing/bootstrapped_on_fedora9/ganglia-3.1.4.tar.gz -1 Changing the release package in the middle of a release is a bad idea; indeed changing it without bumping the release version goes against our release procedures, as it could result in different binary packages and was the reason why the unofficial package I provided was published far from the ganglia servers to hopefully avoid any confusion and frustration if it was found later that someone finds a bug which happens to be only reproducible in the other version. There is also the risk of introducing a bug (like the one in 3.1.2 from bootstrapping in SuSE with automake 1.9.6 which prevented users that had the 32bit libraries for apr installed on 64bit systems to get a working build) and so as much as I am excited about finally moving to some more modern versions of autotools, this make only sense as part of 3.1.5, and which will hopefully also allow for enough time to remove all needed hacks and finally cleanup the bootstrapping code. Carlo -- Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] [Ganglia-general] Ganglia 3.1.4 beta ready for testing
On Mon, Nov 02, 2009 at 11:09:40PM +, Daniel Pocock wrote: I note Paul is using gcc, whereas I'm building and testing with Sun Studio on the OpenCSW build farm - Sun's compiler is now a free download, and it is used to build all the CSW libraries (including those used by Ganglia), so this is now the easiest solution to support - that, and Solaris 8 support, led me to tweak the configure.in stuff for Solaris - maybe it needs more tweaking to support gcc - would anyone like to comment on the preferred gcc build environment to be supported? IMHO any gcc should work, and indeed gcc was the originally supported compiler for ganglia in Solaris (Sun Studio was added later in 3.1.1 when it was made freely available with OpenSolaris). while working on libmetrics (as can be seen in the corresponding metrics.c file) the following versions of gcc were used (most of them using SUNWtoo and other SUNW provided tools as part of the toolchain when possible) : Solaris 7 x86 (32-bit) with gcc-2.8.1 (this one used GNU binutils AFAIK) Solaris 8 (64-bit) with gcc-3.3.1 Solaris 9 (64-bit) with gcc-3.4.4 Solaris 10 SPARC (64-bit) and x86 (32-bit and 64-bit) with SUNWgcc On the issue of the gcc environment, we basically need a second version of scripts/build-solaris.sh for gcc - this raises questions like should the libraries (apr, confuse) be built with gcc too? Which ld, ar, etc? This is IMHO a packager call after all we don't provide binaries (well we do but almost no one uses them) because as you pointed out the decision on which toolchain to use needs to be made at the distribution or system engineering level and so we are left to support them all the best we can. In cases were there is some overlap (like in the case of the CSW packages, where the package maintainers are also upstream contributors) or when it helps to simplify maintenance on a specific platform (like the CentOS 4 RPMs or the Makefile.WiX recipes for Cygwin) then it makes sense to have some additional code to help with it and also some more testing or confidence about the resulting binaries working as expected, but that shouldn't be ever considered as the only supported solution IMHO. Carlo -- Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] bootstrapping ganglia with modern autotool versions for release (was Re: Ganglia 3.1.4 beta ready for testing)
On Fri, Oct 30, 2009 at 12:28:03PM -0700, Bernard Li wrote: I have a Fedora 9 VM that I can use to bootstrap in the future -- would the autotools that come with that version work? something with libtool 2.2 probably better, as well as something that is still getting updates (in case there are bugs that need to be fixed). Fedora 12 is going to be released in a couple of weeks and therefore Fedora 10 will go out of support a month after that, leaving Fedora 9 EOL for more than 3 months already : https://www.redhat.com/archives/fedora-announce-list/2009-July/msg4.html Carlo -- Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] [Ganglia-general] Ganglia 3.1.4 beta ready for testing
On Tue, Oct 27, 2009 at 09:52:52AM +, Paul Sobey wrote: /usr/include/sys/feature_tests.h:336:2: error: #error Compiler or options invalid; UNIX 03 and POSIX.1-2001 applications require the use of c99 make[2]: *** [getopt1.o] Error 1 Googling leads me to try compiling with CFLAGS=-std=gnu99 per: http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=215 this is a bug on the autoconf from CentOS 4 which is used to build the release packages, therefore you can also workaround the issue by rebootstrapping the package or making your own with a better version of the autotools. for simplicity I'd uploaded an unofficial release package for 3.1.4 bootstrapped on fedora rawhide in : http://sajino.sajinet.com.pe/ganglia/ganglia-3.1.4.tar.gz If do that, compilation fails building against Python 2.6.2 (built with same toolchain): once you use -std=gnu99 is no longer the same toolchain and therefore building python with the same standard support should solve your problem. Carlo -- Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] [Ganglia-general] Ganglia 3.1.4 beta ready for testing
On Thu, Oct 29, 2009 at 04:44:59PM +, Paul Sobey wrote: I note from the Makefile Daniel posted: # Depends: some issues exist getting the Python support working on Solaris, # Ganglia's configure.in needs to be further enhanced for this to work I think this is a CSW specific problem, as I had no problem getting python support compiled in Solaris 10u7 x86 using SUNWPython-devel, SUNWgcc, SUNWlexpt and compiled versions of confuse and apr. $ PATH=$PATH:/usr/sfw/bin:/usr/ccs/bin $ ./configure CC=gcc -std=gnu99 --prefix=/usr/local --with-libarp=/usr/local/apr/bin/apr-1-config --with-libconfuse=/usr/local $ make Daniel, could you elaborate? Carlo -- Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] [Ganglia-general] Ganglia 3.1.4 beta ready for testing
On Thu, Oct 29, 2009 at 08:42:05PM +, Daniel Pocock wrote: Carlo Marcelo Arenas Belon wrote: On Thu, Oct 29, 2009 at 04:44:59PM +, Paul Sobey wrote: I note from the Makefile Daniel posted: # Depends: some issues exist getting the Python support working on Solaris, # Ganglia's configure.in needs to be further enhanced for this to work Daniel, could you elaborate? Although I have described the Python module in the CSW Makefile, it is not something I have properly tested. OK and I haven't done any testing either, other than making sure it builds and that a mod_example like module can be loaded, but my question was more about the need to change configure.in to support python modules which you were referring about in the Makefile as Paul noted. I am still working through some core agent problems (e.g. see the discussion on csw-maintainers about building a 64 bit version of everything: I've noticed that when running a 32 bit binary on some 64 bit machines with lot's of RAM, some kstat calls lead to a seg fault) care to provide a link to the thread or any bug reports?, earlier releases for 3.0 required 64bit binaries as they were reading kernel memory directly to gather the statistics, but after those metrics were migrated to kstat that shouldn't be an issue anymore, and I am running some 32-bit 3.0 agent with solaris sparc with significant amount of memory as well, so there might be a regression to track here. Carlo -- Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
[Ganglia-developers] Solaris support for 3.1.4 (Was RE: Ganglia 3.1.4 beta ready for testing)
Trimming CC and changing Subject to better focus this thread On Thu, Oct 29, 2009 at 10:09:32PM +, Daniel Pocock wrote: Carlo Marcelo Arenas Belon wrote: my question was more about the need to change configure.in to support python modules which you were referring about in the Makefile as Paul noted. I definitely remember playing with configure.in to try and get the Python support working with CSW, although I'm not certain what state I left it in. do you mean you have a modified 3.1 package that had extensions for CSW python support?, are those modifications available somewhere? I've done a diff from 2017:HEAD on trunk/configure.in, it appears that none of my changes for Python support on Solaris are in there. should I then assume that neither trunk or ganglia-3.1 had any python support related patches committed from you then? In one branch I started working on I tried setting up my own LDFLAGS for Python, e.g. in configure.in: LDFLAGS_PYTHON=-lpython${PY_VERSION} or for static: LDFLAGS_PYTHON=/opt/csw/lib/python2.3/config/libpython${PY_VERSION}.a -lm and using LDFLAGS_PYTHON with when linking the Python module. the following should be all that is needed IMHO (not tested and assuming the location/name of the python binary from your previous comments) : ./configure --with-python=/opt/csw/bin/python2.3 However, I don't think this is best practice for configure.in. I can have a go at making it work, but it would be useful to agree on the compatibility requirements first: e.g. should compatibility with CSWpython be the main goal, or do we want to set some other criteria? not sure what you mean here, but AFAIK the objective was to be able to use python2.3 or higher (just because CentOS 4 uses that) Solaris 10 comes with python2.3 and python2.4 (Through SUNWPython) but in theory any version of python should work if configure is pointed to it. In Gentoo Linux 10.0 amd64, Fedora 12 or Ubuntu Karmic that came with python 2.6 all python modules should work even if the tcpconn.py module might warn about deprecated use of popen (which means as soon as someone moves to Python 3 that module at least will break) I think the core modpython should build with any python 2.x and maybe 1.x as well, but I don't think anyone ever tested/needed that. I am still working through some core agent problems (e.g. see the discussion on csw-maintainers about building a 64 bit version of everything: I've noticed that when running a 32 bit binary on some 64 bit machines with lot's of RAM, some kstat calls lead to a seg fault) care to provide a link to the thread or any bug reports?, earlier releases for 3.0 required 64bit binaries as they were reading kernel memory directly to gather the statistics, but after those metrics were migrated to kstat that shouldn't be an issue anymore, and I am running some 32-bit 3.0 agent with solaris sparc with significant amount of memory as well, so there might be a regression to track here. I've been discussing the issue privately with Dago, it is easily reproducible on the host called build8st in the CSW build farm. All my latest packages are on the box already so if you request an account, you can try it. I'll forward you the email. OK, the problem might be Solaris8 specific then, since my Solaris 9 and 10 binaries didn't have that problem. Hopefully will be able to figure out how to get a CSW account then, but if you could get a core dump (better if from an unstriped binary) or some backtraces could help on debugging this issue. The more general discussion on building packages containing both 32 and 64 bit libraries started here: http://lists.opencsw.org/pipermail/maintainers/2009-October/004687.html OK, do you have any references or documentation for the kstat requirement on 64bit kernels?, at least on my Solaris 10u7 system vmstat is 32bit and linked against a 32bit version of libkstat (even if a 64bit version is also available) Carlo -- Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] [Ganglia-general] Ganglia 3.1.4 beta ready for testing
On Thu, Oct 29, 2009 at 01:10:14PM -0700, Bernard Li wrote: On Thu, Oct 29, 2009 at 12:01 PM, Carlo Marcelo Arenas Belon care...@sajinet.com.pe wrote: this is a bug on the autoconf from CentOS 4 which is used to build the release packages, therefore you can also workaround the issue by rebootstrapping the package or making your own with a better version of the autotools. ?for simplicity I'd uploaded an unofficial release package for 3.1.4 bootstrapped on fedora rawhide in : ?http://sajino.sajinet.com.pe/ganglia/ganglia-3.1.4.tar.gz Do you have a link for the bug, and are you aware whether there are updates for CentOS 4 to fix the issue? I am not aware of a CentOS or RHEL bug report, but considering that EL4 is in maintenance mode there won't be a fix anyway (2.59 was released in 2003 and the last update to package was in 2004) I guess I could start building on CentOS 5, provided that the autoconf does not have this bug. CentOS 5 also uses autoconf 2.59 so wouldn't help with this problem, but might hopefully allow us remove all the kludges that were added to workaround the libtool 1.5.6 bugs which were preventing DragonFlyBSD support. Ideally, which platform is used to bootstrap shouldn't be relevant though and IMHO we should be instead aiming to the latest versions of the autotools (either installed by hand or provided as part of the distribution if more development focused) and for that when on Linux usually means Fedora, Gentoo or Debian IMHO. Carlo -- Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
[Ganglia-developers] bootstrapping ganglia with modern autotool versions for release (was Re: Ganglia 3.1.4 beta ready for testing)
Trimming CC and changing Subject to reflect thread better On Thu, Oct 29, 2009 at 04:54:17PM -0700, Bernard Li wrote: On Thu, Oct 29, 2009 at 4:47 PM, Carlo Marcelo Arenas Belon care...@sajinet.com.pe wrote: Ideally, which platform is used to bootstrap shouldn't be relevant though and IMHO we should be instead aiming to the latest versions of the autotools (either installed by hand or provided as part of the distribution if more development focused) and for that when on Linux usually means Fedora, Gentoo or Debian IMHO. I have no problem with this in theory, but would new version of autotools create a tarball that is backward compatible? Curious about how a backward compatible tarball would be like, but if by that you meant that you can use the resulting configure script and supporting files on systems that are much older and that never had a package for autotools with the same version that were used, then the answer is yes, that is the whole point of autotools anyway, to support other systems than the one that was originally used to build the code on, without requiring to have any of the autotools themselves installed. As for testing, is there any problem you had found on the unofficial package that I posted and that was build on Fedora Rawhide x86 : http://sajino.sajinet.com.pe/ganglia/ganglia-3.1.4.tar.gz Other than being able to work on Solaris and not asking for an unnecessary C++ compiler I wouldn't expect it to be that different when used to build ganglia binaries. Carlo -- Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] [Ganglia-general] Ganglia 3.1.4 beta ready for testing
On Mon, Oct 26, 2009 at 04:51:33PM -0700, Bernard Li wrote: Ganglia 3.1.4 is ready for testing at: http://ganglia.info/testing/ DragonFlyBSD fails to build (tested with 2.4.0 32bit). not a regression (a system header problem which also affects 3.1.2) and there are some trivial unrelated changes in trunk which could help with that. Carlo -- Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] [Ganglia-general] Ganglia 3.1.4 beta ready for testing
On Tue, Oct 27, 2009 at 10:15:15AM +, Daniel Pocock wrote: Carlo Marcelo Arenas Belon wrote: DragonFlyBSD fails to build (tested with 2.4.0 32bit). how do you propose we avoid missing stuff like this in future? Not sure if this can be avoided, but there might be some things we could do to mitigate like : * periodic snapshot releases * automated buildfarm * census of developers/users which could help with testing per platform * some sort of unit tests or other tests which could be automated 3.1.3 was floating around for a while, and I tested it on RHEL3, RHEL4, RHEL5, Solaris 8, Solaris 10, Debian lenny and Cygwin. And that was very useful as it uncovered a problem in RHEL3 AFAIK. Do you think we need some checklist of supported platforms that must be verified before we tag in future? The release notes includes a list of supported platforms which usually reflect the ones that we got reports were working fine during the testing cycle. My objective with this report was to let you know that DragonFlyBSD wasn't one of them. I wasn't implying that report was to be considered as a showstopper for the release either. We may need to have some buy-in from people willing to run the tests at short notice as a release date approaches, as not many people are going to have easy access to every supported platform. And not all people will be available either. Maybe commits on the release branch need to be blocked while such testing is done? This already happens AFAIK even if there is no formal provision against it, but just to simplify the release manager job who would need otherwise to cherry-pick a release branch. Setting up a list of objectives before the release is tagged might help though clarify what is expected from the release (including clarification of which areas will need testing or which features are to be evaluated) and also which platforms are expected to be supported based probably on feedback provided from pre-release packages. Carlo -- Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] changes to trunk, backports
On Mon, Aug 10, 2009 at 09:24:18PM +0100, Daniel Pocock wrote: I'm making some more changes to trunk over the next few days, some of them impact the build system (configure.in, Makefile.am). you mean more than the ones that were already committed from r2017 to r2021? what are the changes you are planning to commit for? I've been able to test on Linux (RHEL 4 and 5), Solaris 8 and 10 and Cygwin - I realise people using other platforms may encounter different results, so I will stagger most things by one or two weeks before backporting to the 3.1 branch. i think that r2021 is either a fix or a good part of a fix for BUG16, feel free to reassign that bug to yourself and finish it ;) The last round of changes I made have been backported to the 3.1 branch today, hopefully this will allow more people to evaluate it before the next release. probably will be a good idea to document them in the STATUS file so they don't get lost for the future release notes. Carlo -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Stack smashing in Linux gmond while reading long lines from /proc/mounts.
On Thu, Jul 30, 2009 at 10:33:03AM -0400, Jason A. Smith wrote: On Thu, 2009-07-30 at 10:29 +, Carlo Marcelo Arenas Belon wrote: On Wed, Jul 29, 2009 at 03:42:05PM -0400, Jason A. Smith wrote: In gmond, the monitor-core/libmetrics/linux/metrics.c:find_disk_space() function, was not only using small character arrays, but the arrays for the sscanf after the fgets were smaller than the array for the line it just read in, which can lead to buffer overflows and the stack smashing problem that we were having. using fixed size arrays in the stack is never a good idea. in this case it could be theoretically possible to exploit this overflow with the help of a malicious NFS server (very unlikely though). To fix out problem and prevent the overflows, I made a patch to increase the size of the arrays and also make each of the arrays used in the sscanf the same size as the line buffer used in fgets, so there is no chance of another overflow. committed for trunk in r2007, but the new implementation might also generate segfaults on its own due to stack overflows when running with very small stacks as it requires a bigger stack. IMHO it would be better to migrate this function to use getmntent and friends as it was done already for Cygwin, Solaris and the BSD and that way avoid the use of local buffers and parsing of the /proc/mounts file directly. This is a good idea, I didn't think to check the libmetric code from the other OSes. Besides a few minor differences for the remote_mounts valid_mount_type functions and the seen_before part, the solaris linux code look almost identical. if I recall correctly, when I added disk metrics for Solaris it was indeed modeled after the linux code except for the following differences (there was a thread started at a time that got nowhere) : * originally using GiB instead of GB (changed since) * whitelist of filesystems instead of a blacklist * using getmntent and friends * avoid use of hashes, pointers or static buffers. Should the linux code be updated to look more like the solaris code? It would be probably easier to change the implementation to use getmntent instead of parsing the /proc/mounts file by hand and which might as well help to simplify the code further. Too bad it isn't possible to merge similar code instead copying it. there is no reason why we can't (if needed) write the code in 1 single place and link it against the OS specific metric code later (as shown by libmetrics/interface.c), but it is usually just easier to copy the code as it will need to be adjusted to work properly for each supported OS (and different versions of it as well, with different ABI/API) and that way avoid having to obscure the logic with portability constructs which are required otherwise. considering though that getmntent is available in all the OS that are implementing disk metrics AFAIK, this should be probably (as suggested originally) a secure, portable alternative IMHO. Carlo -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Replacing core metrics with Python metric modules
On Wed, Jul 22, 2009 at 04:02:01PM -0500, Martin Hicks wrote: I was wondering if is possible to write a Python metric module that could replace the core set of metrics that gmond usually collects on the compute node, and instead grab the data from PCP that is running on the head node. this looks like a perfect place to use gmetric with spoofing instead (assuming that you are not loading the core metrics in gmond, or not even running gmond at all if all you need is already coming from PCP) Are there any real differences between the metrics that are normally collected by gmond, and those user-defined metrics collected by a Python module? starting with 3.1 all metrics are the same Carlo -- ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] boottime and uptime (the saga continues)
On Fri, Jul 17, 2009 at 07:10:35AM -0700, Ken Teague wrote: I'm using Ganglia v3.0.3 on openSUSE 10.3 which came pre-configured on a Microway cluster. It's slightly modified to add their Microway Control stuff integrated which is basically a button from the Ganglia homepage which leads to their TriCom/NodeWatch thermal monitoring web page. As such, I don't think that the issue I'm having has anything to do with their customization, but I wanted this to be known beforehand in case the possibility exists. hope that customization is only in the web frontend because a solution to your problem might require a patched gmond. The issue I'm facing is with incorrect boottime and uptime for my master and all slave nodes. seems like BUG169 that was fixed in ganglia 3.1.2 and will be part of 3.0.8 http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=169 the problem, most likely, is that your /proc/stat is too big in the affected servers. Carlo -- Enter the BlackBerry Developer Challenge This is your chance to win up to $100,000 in prizes! For a limited time, vendors submitting new applications to BlackBerry App World(TM) will have the opportunity to enter the BlackBerry Developer Challenge. See full prize details at: http://p.sf.net/sfu/Challenge ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Building a ganglia interface into collectl
On Wed, Apr 01, 2009 at 11:01:41AM +, Seger, Mark wrote: I hear what you're saying about using the API, but collectl is a perl script there are perl bindings for libganglia (at least the metric generation) in : http://search.cpan.org/~hirose/Ganglia-Gmetric-XS-1.00/lib/Ganglia/Gmetric/XS.pm beware though that you won't be able to use with ganglia 3.1.x and will essentially link a static version of ganglia's metric implement into your code but at least will avoid having to generate binary packets and reverse engineer the protocol as it changes. Carlo -- This SF.net email is sponsored by: High Quality Requirements in a Collaborative Environment. Download a free trial of Rational Requirements Composer Now! http://p.sf.net/sfu/www-ibm-com ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] CVE
On Fri, Jan 23, 2009 at 08:52:45AM -0700, Brad Nicholes wrote: Are we finished hashing this whole patch out yet? haven't seen many comments from other testers of the simplified patch, but considering that it has been included already in the 3.1.1 stable package from Gentoo x86, I'd assume it is hashed out already. Fedora and Debian are also testing patches for their packages AFAIK. Are we ready to apply the current patch to 3.1.2 and release or is there still more discussion going on? guess it depends on how you define current patch as the backported patch has still one hunk that was originally meant to be for gmetad's multi request proposed feature that is still under discussion and hasn't been committed yet (a second hunk was reverted already as it showed a regression in the web frontend while testing the proposed Fedora package update that was using it). in any case to avoid further delays (even if IMHO not ideal, but better than the current situation) committed the backported patch in r1959 for ganglia-3.1. also committed r1960 to make the new introduced feature (returning and empty response instead of the full tree if the request to the interactive port is invalid) consistent. Carlo -- This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] patches for: [Sec] Gmetadserver BoFandnetwork overload + [Feature] multiple requestsper connoninteractive port
On Sun, Jan 18, 2009 at 11:22:27AM +0800, Spike Spiegel wrote: the comment should be removed since the +1 is there: + /* +1 not needed as q-p is already accounting for that */ + element = malloc(len + 1); Committed revision 1950 other than that looks good to me. could you check the simplified one?, this problem was introduced in 2003 and therefore affects all versions of ganglia since then (including 2.5.7 which is not supported anymore and that will need to be patched by the users of it which include Debian/Ubuntu, Novell/OpenSuSE and probably others). Two things: 1) How has this been tested? I did some myself and got to wonder how you guys did it, do you have any standardized approach? sadly there is no test suite associated with ganglia code and therefore there is no standardized approach other than applying the patch and banging the resulting binary to see if it works reliably. 2) you mention backports to 3.1 and then move on to 3.1.2, what about 3.0? Some of us (quite a few?) are still running 3.0 and afaik kostas already applied the patch to that branch and ran some tests (and so did I - and server.c hasn't changed for a long time so it should be indeed a safe operation) it will be included in 3.0.8 as well. Carlo -- This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] patches for: [Sec] Gmetad server BoF and network overload + [Feature] multiple requests per conn on interactive port
On Tue, Jan 13, 2009 at 11:41:19PM +0800, Spike Spiegel wrote: === DoS attacks 1) Given REQUESTLEN=2048, and 3 characters to be the minimum to craft a valid and nonexistent path /x, with the above feature implemented it would be possible to trigger 2048/3 calls to process_path which would possibly lead to CPU overload. this is not handled by any of the provided patches but since processing is aborted as soon as the path is considered invalid the depth of the path is not relevant for CPU or bandwith utilization 2) extension to 1) - as it is ganglia returns the entire tree if an element is not found. with large trees 2048/3 requests could easily result in several GBs of data being transferred. Related to this if you look at gmetad/server.c lines 601:606 you'll see this: err_msg(Got a malformed path request from %s, remote_ip); /* Send them the entire tree to discourage attacks. */ strcpy(request, /); which leads to the same scenario as above. the amount of data returned is not dependent on the depth of the path because it will always be the full XML tree (once). What I propose is that for both cases, malformed request and non existent items, we log an error and bail out. This would solve 2) and most of 1) making the call possibly exist far quicker. the proposed solution will result in a truncated XML which then will fail to be parsed in the client and in an obscure error like unable to write XML tree info. agree that returning the whole tree isn't the best way to signal a syntax error, but returning a truncated XML will be more difficult to handle in the client side as depending on the implementation used it will fail to even load with an exception. because the connection to the client is getting severed when it is malformed it will also show strange errors like unable to write root preamble (DTD, etc) or Connection reset by peer in the client. Carlo -- This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] patches for: [Sec] Gmetadserver BoFandnetwork overload + [Feature] multiple requestsper connoninteractive port
On Sun, Jan 18, 2009 at 09:53:32PM +0800, Spike Spiegel wrote: On Sun, Jan 18, 2009 at 7:35 PM, Carlo Marcelo Arenas Belon care...@sajinet.com.pe wrote: other than that looks good to me. could you check the simplified one?, this problem was introduced in 2003 and therefore affects all versions of ganglia since then (including 2.5.7 which is not supported anymore and that will need to be patched by the users of it which include Debian/Ubuntu, Novell/OpenSuSE and probably others). apologies but I lost you there, what do you mean with the simplified one? http://bugzilla.ganglia.info/cgi-bin/bugzilla/attachment.cgi?id=188action=view that should apply cleanly to 3.1.1, 3.0.7 and 2.5.7 Is that what you meant when you said banging to resulting binary? partially; scripts would be able to help only after the testing parameters had been defined, and at least for this test might be limited by the fact that the interactive port is mainly used by the web frontend. Carlo -- This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] cygwin build stuck on libintl
On Fri, Jan 09, 2009 at 01:41:09PM -, daniel.poc...@barclayscapital.com wrote: One thing I notice about README.WIN is that it doesn't tell me which sections of the Cygwin setup to look in for each dependency, I really some of them are obvious, but it would save time listing them for those of us who don't like clicking around the Cygwin setup GUI: apr1 expat diffutils (Utils) python (Python) sharutils (Archive) sunrpc (Libs) bison (Devel) flex (Devel) libtool (Devel) Committed revision 1943 If there is a way to run the Cygwin setup.exe and request automatic installation of these packages from the command line, that would be useful to show in README.WIN not that I am aware of, but using the Full View allows you to get an alphabetically ordered list of packages which helps. configure gets stuck on libconfuse (libintl dependency issue) $ cd confuse-2.6 $ make clean ./configure make make install builds successfully $ cd ../ganglia-trunk $ ./bootstrap $ ./configure --with-libconfuse=/usr/local --enable-static-build various messages Checking for confuse Added -I/usr/local/include to CFLAGS Added -L/usr/local/lib to LDFLAGS checking for cfg_parse in -lconfuse... no Trying harder including gettext checking for cfg_parse in -lconfuse... no Trying harder including iconv checking for cfg_parse in -lconfuse... no libconfuse not found all this extra checks were added to workaround problems in libconfuse when using an external to libc gettext implementation, but they are obviously not able to workaround cygwin if NLS is enabled. since this is a problem on libconfuse, better disable NLS support on it as it is not needed anyway. libiconv, gettext and gettext-devel are definitely installed in Cygwin that is the problem. updated the documentation to explicitally disable NLS support even if those packages (which are not in the list of what is needed for this reason) where installed and found at configure time. Carlo -- Check out the new SourceForge.net Marketplace. It is the best place to buy or sell services for just about anything Open Source. http://p.sf.net/sfu/Xq1LFB ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Ganglia web cluster/grid name patch.
On Fri, Jan 09, 2009 at 04:29:34PM -0500, Jason A. Smith wrote: Here is another patch. Committed revision 1940. Made some slight changes to keep indentation as is currently being used in the affected files and convert tabs into spaces. Carlo -- Check out the new SourceForge.net Marketplace. It is the best place to buy or sell services for just about anything Open Source. http://p.sf.net/sfu/Xq1LFB ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Web templates subversion access.
On Sat, Jan 10, 2009 at 06:03:42PM -0500, Jason A. Smith wrote: On Sat, 2009-01-10 at 11:25 +, Carlo Marcelo Arenas Belon wrote: On Thu, Jan 08, 2009 at 04:13:13PM -0500, Jason A. Smith wrote: Recently I started testing the svn version of the web scripts and found a few bugs. could you elaborate on the bug this patch is fixing?, the code from trunk and the other 2 active branches seems similar enough to consider this a problem also outside of trunk. I have not tested nor looked at any other trunks unfortunately. I have a working 3.0.7 installation, and this problem does not exist there. AFAIK, it doesn't exist in trunk either as I am unable to reproduce it. could it be that some patch you are adding is triggering the behaviour? Fixed graph zooming and make sure the default summary graph size overrides the size selected for the cluster graphs. graphargs should contain (in the host view) all host specific parameters that are needed to create the graphs like h (hostname), r (range) and st (start time) but doesn't have z (size) and therefore by moving it to the beginning of the generated URL all it changed is the order used, at least for the report graphs which are the ones that are being patched. The problem I noticed was mainly in the cluster view, when selecting a specific metric to look at or the size, it would also change the size of the top summary graphs in addition to the lower host graphs. I assume this is an unintended consequence of probably a few patches interacting together since previous versions of ganglia did not do this, but I did not bother to track it down. if a z variable ever gets into graphargs that would most likely break the code the way you describe, but as I mentioned before it doesn't seem to be doing that, or at least it doesn't seem to be doing that in a clean checkout from trunk that I'd been testing for this bug. I think the problem also occurred in the host view, but I don't really remember. The meta view already had the graph variable placed first in the arg list, so patching in this way also makes all three main graph views work the same. The only change, as you say is the order of the graphargs to force the medium size to override what is in the variable. I agree that the way it is coded is fragile and your patch is just making all the references more consistent by changing the order, but as I argue below, I think that relying on the variable to be overridden and the order of the variables to be of significance (as your patch suggests) is the wrong approach to solving the problem. In any case, at least for now, Committed revision 1942. for the metrics graph, the order does (sadly) make a difference as the zoom relies on having z redefined to large through the template, but the patch doesn't apply to that section of the code. I am not sure what you mean here, the patch does apply to the zoom feature, since it does touch the graph image links also. In addition to the summary graphs at the top being affected, I noticed that zooming was also broken. I was commenting on the way the template for host view is constructed and in the fact that for the metric graphs, the position for graphargs was important as it shows z=medium as part of graphargs and then relies in the template to override that with z=large for the zoom to work. Your patch changes the way the URLs for the report graphs are being generated so it matches the way the ones for the metric graphs but doesn't change the code there as it is already working even if in a hacky way. could we instead remove the hardcoded values and manage the URLs in a way that makes them not dependent on the order of the parameters so that variables are overridden? Possibly, this was just the easiest fix that I thought of though, since it keeps the graph args variable in the list, so they can share things like the time range, so they don't have to be managed separately, just the order was changed, to act like an override. In host_view.php:54, graphargs is defined for each metric to have a hardcoded size medium and that is then overridden in the HREF through the template so that the link used when the graph is clicked has : c=$clusterh=$hostname...z=medium...z=large My argument was that it will be IMHO better if the graph size would be configurable and therefore both sizes for the graph detached from the graphargs variable so that the code won't have to rely on a variable being overridden or the order the arguments have as it does now. Carlo -- Check out the new SourceForge.net Marketplace. It is the best place to buy or sell services for just about anything Open Source. http://p.sf.net/sfu/Xq1LFB ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo
Re: [Ganglia-developers] Web templates subversion access.
On Thu, Jan 08, 2009 at 04:13:13PM -0500, Jason A. Smith wrote: Recently I started testing the svn version of the web scripts and found a few bugs. could you elaborate on the bug this patch is fixing?, the code from trunk and the other 2 active branches seems similar enough to consider this a problem also outside of trunk. Fixed graph zooming and make sure the default summary graph size overrides the size selected for the cluster graphs. graphargs should contain (in the host view) all host specific parameters that are needed to create the graphs like h (hostname), r (range) and st (start time) but doesn't have z (size) and therefore by moving it to the beginning of the generated URL all it changed is the order used, at least for the report graphs which are the ones that are being patched. for the metrics graph, the order does (sadly) make a difference as the zoom relies on having z redefined to large through the template, but the patch doesn't apply to that section of the code. could we instead remove the hardcoded values and manage the URLs in a way that makes them not dependent on the order of the parameters so that variables are overriden? Carlo -- Check out the new SourceForge.net Marketplace. It is the best place to buy or sell services for just about anything Open Source. http://p.sf.net/sfu/Xq1LFB ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] problems with cygwin build
On Sun, Jan 04, 2009 at 06:17:56PM -0800, Jacob Gladish wrote: I'm trying to build the trunk from svn and getting errors generating the configure script. there is a file with instructions called README.SVN and a script that does the bootstrapping for you. haven't checked for a while in cygwin but it should work fine; if it doesn't you might want to do the bootstrapping in Linux and generate a release package instead. It looks like this is some autoconf version issue. Am I missing something? most likely automake Carlo -- ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] [Ganglia-svn] SF.net SVN: ganglia:[1925] trunk/monitor-core
On Mon, Dec 01, 2008 at 12:19:25PM -0700, Brad Nicholes wrote: I don't understand what you are trying to do with this patch. as explained in the commit message (which I apologize if it wasn't clear enough), it is correcting the definition of modules so it is correctly tagged as showing multiple times in the configuration as per : http://www.nongnu.org/confuse/manual/confuse_8h.html#a3 of course, I also adjusted all in tree code to use the correct syntax to retrieve all module definitions and that way avoid any problems. Once libconfuse has finished reading and parsing the entire configuration file along with all of the individual modules sections, it automatically consolidates them into a single section. since the modules section contains only other subsections (in this case the section module) which is defined as showing multiple times then all those subsections will be linked to the first section created, and which will be then accessible with a call to cfg_getsec(modules). the problem with that, of course, is that we are then relying on an unintended sideffect of how the configuration structure is being created and that will break if another non section configuration is added later. I'd also argue that if all modules sections are to be collapsed anyway wouldn't be better to get rid of the modules configuration and just list all modules as part of the root? There is no need to try to scan individual modules sections. if the configuration is defined to be shown multiple times, then a call to cfg_getsec will only get one of the instances. This code was working correctly as it was. Please revert this patch. seems it was reverted already in r1931, so added some documentation of the latent problems in r1933 until the compatibility issues raised could be resolved. either implementation will work for the current setup but if you are to reconsider don't forget to revert r1933 as well. fixing any external module that would have problems looking at the module list (most likely useful for script handlers) shouldn't been that difficult IMHO. Carlo -- SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada. The future of the web can't happen without you. Join us at MIX09 to help pave the way to the Next Web now. Learn more and register at http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/ ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] cpu_report and Windows/Cygwin nodes
On Thu, Nov 13, 2008 at 08:57:13AM -, [EMAIL PROTECTED] wrote: We've tweaked some of the reports on an older version of the ganglia-web package so that they work with Windows nodes. you mean a broken image?, just a warning in the web log because the rrdgraph command is somehow broken by the missing data, or the fact that the report is bogus? For example, cpu_report needs to test if the file cpu_nice.rrd exists, and use a modified rrdgraph command if there is no cpu_nice.rrd cpu_nice should exist and be reported in windows (even if it might be bogus or a 0 value like in load) I was about to start updating these patches to trunk, but I'm curious about what else is going on in this area - is any similar work in the pipeline? Is there any imminent plan to overhaul this part of the code, making such work un-necessary? All work we do is publicly available in trunk, and the only work that has been done AFAIK was released with 3.1.0 and was the modular graph framework. Reporting always assumes that all metrics are available for all nodes and that they are all relevant and somehow equivalent, and as you mentioned before that wasn't really a safe assumption considering multiple platforms, the introduction of platform specific metrics and custom metrics in 3.0 and now with 3.1, the introduction of modular plugable metrics. As you said there is IMHO the imminent need for an overhaul of this part of the code and your proposal might be at least a start. Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] spec file from trunk, jscalendar
On Wed, Nov 12, 2008 at 05:07:53PM -, [EMAIL PROTECTED] wrote: Just a quick note, ganglia.spec.in needs to have calendar.php added to the list of files for web, see below: Committed revision 1921 Also, what is the plan for packaging jscalendar? Should it be packaged independently and deployed elsewhere, with a symbolic link created by the ganglia-web RPM, or should this code become part of Ganglia SVN, as proposed for other third party dependencies? Timothy's proposal made it optional, so it might make more sense to have it detached from the ganglia web frontend RPM. This will also help with the already convoluted licensing mix that the web frontend code has as jscalendar is LGPL. TemplatePower is GPLv2+ and the rest would seem to be MIT, so adding LGPL to the mix might not be the best way to try to help some user/distribution to figure out what their legal rights are. Carlo PS. the glue code with jscalendar could be made a little more robust as well from what I recall when reviewing this code originally for inclusion. - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] ganglia-web: custom start and finish times in trunk
On Wed, Nov 12, 2008 at 02:26:22AM +, Jesse Becker wrote: Anecdotally, I can say that the web frontend for trunk should work with 3.1.x considering that the XML interface between the web frontend and gmetad hasn't change, it should work even with a 2.5 gmetad or older (up to maybe the introduction of the interactive port) actually may even work with 3.0.x. This isn't really supported though. if by supported you mean, it is not being tested then I agree; but there shouldn't be any reason to recommend against running it in that configuration, specially considering that there should be an upgrade path to 3.1 from 3.0 and 2.5 that will require this interface to be stable. Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] 3.1.x branch whitespace cleanup for gmond.conf
On Wed, Nov 05, 2008 at 04:21:57PM -0800, Bernard Li wrote: While trying to upgrade my current 3.1.0 installation of Ganglia with the 3.1.1.1901 + spoofing RPMs I just built, I noticed that gmond.conf was created as gmond.conf.rpmnew. This doesn't surprise me much, as I know that there were minor configuration changes between 3.1.0 and 3.1.x branch. there were changes from 3.1.0 to 3.1.1 because of the removal of the module path as described in the release notes, and also changes from 3.1.1 to 3.1.2 because of the removal of support for clusterless gmond and adding the option to filter out the extra metric metadata. However, when I ran diff, the result was that the *entire* file was completely different. diff -b (assuming the version of diffutils from CentOS 4 supports that) will let you see the relevant changes that will need porting. I double checked this and found out it is the same between 3.1.1 and 3.1.1.1901 (i.e. the entire 3.1.1 gmond.conf file was different from 3.1.1.1901 gmond.conf file). Turns out, it was because of this recent backport in 3.1.x branch: http://ganglia.svn.sourceforge.net/viewvc/ganglia?view=revrevision=1882 Is this really necessary? While in general I am in favour of whitespace cleanup, in this case I think since it impacts usability, I think we should punt this until later. later will just defer the problem, and since every release will have most likely configuration changes which will require updates then you are not solving anything but just deferring this. if the concern is really about usability then `gmond -r` could be used if extended to aid in configuration migrations, but considering that functionality was broken in 3.1 until recently (fixes will be released with 3.1.2) for migrations from 2.5 and migrations from 3.0 were being handled by some hacky patch and manual steps, all that should be needed in that case will be to update the procedures to use the right flags. Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Comment on gmetric backport proposal in 3.1.x branch
On Fri, Nov 07, 2008 at 02:52:45PM -0800, Bernard Li wrote: On Fri, Nov 7, 2008 at 2:41 PM, Carlo Marcelo Arenas Belon [EMAIL PROTECTED] wrote: Care to explain what your comment means? are you serious?, if so will take some time later in the night to reply with Yes I am. Otherwise I wouldn't send the email to begin with. I see, and seems you keep confusing my email address with the one from the ganglia developers list. considering how many people is going to have to read through this email or at least delete it, even if it wasn't directed to them, seems like a waste of our collective energy, so I apologize in advance to everyone for this reply. In case you need clarification on my question, I am confused about this: patched generated files Did you mean: The patch generated files no. or Generated files were patched yes; so it seems you were not that confused after all. Please explain what these generated files are. a generated file is a file that is generated through some process (like a compiled binary is generated by processing source code with a compiler). in this case the files you patched (gmetric/cmdline.h, gmetric/cmdline.c, mans/gmetric.1) are not meant to be patched directly because they are generated through a process as it is clearly explained in the header of them all : $ head gmetric/cmdline.h /** @file cmdline.h * @brief The header file for the command line option parser * generated by GNU Gengetopt version 2.22 * http://www.gnu.org/software/gengetopt. * DO NOT modify this file, since it can be overwritten * @author GNU Gengetopt by Lorenzo Bettini */ $ head gmetric/cmdline.c /* File autogenerated by gengetopt version 2.22 generated with the following command: /usr/local/bin/gengetopt --input ./cmdline.sh The developers of gengetopt consider the fixed text that goes in all gengetopt output files to be in the public domain: we make no copyright claims on it. */ and my favorite : $ head mans/gmetric.1 .\ DO NOT MODIFY THIS FILE! It was generated by help2man 1.36. .TH GMETRIC 1 March 2008 gmetric User Commands .SH NAME gmetric \- manual page for Ganglia Custom Metric Utility .SH SYNOPSIS .B gmetric [\fIOPTIONS\fR]... .SH DESCRIPTION The Ganglia Metric Client (gmetric) announces a metric on the list of defined send channels defined in a configuration file the cleanup I refer to is to ensure that the changes are done in the original sources so that your changes won't be blown away next time, and then regenerate the files as it was meant to be. will see also if some Makefile rules could be added so rebuilding them is as simple as making an RPM, so that we could prevent this kind of issues and waste of energy in the future. definitely not bad for being your first commit ever to gmetric, so feel free to backport it, so our users will be able to use it instead of getting an answer like : http://www.mail-archive.com/[EMAIL PROTECTED]/msg04159.html which for some strange reason never had a reply and got Filippo in the right direction as he had no more problems with ganglia since. Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Spoofing patch...
On Tue, Nov 04, 2008 at 06:13:47PM -0700, Brad Nicholes wrote: Carlo, In the STATUS file you commented that the spoofing patch needs more work. my comment was about the spoofing feature needing more work in 3.1 as you pointed out below. the patch just makes the problem bigger by adding several changes (more than 500 lines) on top of an implementation that has open regressions and that will need to be cleaned up first IMHO. Can you explain what work needs to be done. Other than supporting the short commandline for spoofing a heartbeat, AFAICT everything is working as it should. current 3.1, without the additional patch is missing the heartbeat spoofing support as you pointed out, and is also exporting the spoofed data in the XML which even if mostly harmless is a change of behaviour and should be cleaned up as well (as it really adds no value and is inconsistent) as for 3.1 with the patch added, there are the following problems : * the added calls for toupper() in libgmond could result in undefined behaviour in platforms where char is signed and toupper is implemented using an array lookup (NetBSD, and probably HPUX and AIX) * the proposed patch has some additional patches that will need to be added on top of it (some of them already proposed and approved like the modpython fixes, but some others still not even proposed) I have committed a spoofing example python module in trunk that spoofs cpu_util, boottime, heartbeat and osname for three imaginary machines. This spoof example module runs under both trunk and the patched 3.1.x. Sadly I haven't been able to make it work, with `gmond -m` showing : # gmond -m (module python_module) load_oneOne minute load average (module load_module) ... and python modules in general not working anymore (linked against amd64's python 2.5.2 in Linux) the C interface seems to work fine at least in 3.1 (in trunk it messes GMOND_STARTED as explained before) I have also created a patch for sending a heartbeat through the shortened commandline which has been proposed as a follow-on backport. saw that, not sure about the needed dependency in the modular spoofing support as IMHO a change to gmetric should be independent of that. If there is nothing else missing, can we get this one backported? with the availability of a proposed backport patch that fixes the conflicts and an example python spoofing module it should be easier to do so but as I pointed before not yet sure enough to stamp a vote on it, but I agree we should get this patch/feature released with the next release. Carlo PS. can the full list of patches needed from backport be added to the list, I suspect r1615 is missing as that should be required for r1622 which I added and is included in your consolidated patch - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] backport proposal: mcast_if support in gmond (BUG140)
On Tue, Oct 28, 2008 at 12:42:28PM -0600, Brad Nicholes wrote: * libganglia: mcast_if support in gmond (BUG140) http://www.mail-archive.com/ganglia-developers@lists.sourceforge.net/msg03775.html http://ganglia.svn.sourceforge.net/viewvc/ganglia?view=revrevision=1734 +1: carenas -1: bnicholes bnicholes: The patch appears to be dependant on rev. 1478 in trunk which is not included in the backport proposal. carenas: no, otherwise would had been included; see BUG140 for patch Carlo, I don't understand your response. Agree is a little concise as it was constrained by being in the STATUS file, which is again a reason why I suggested we discuss the patches in the list instead, as using email is a far more expressive medium and IMHO better oriented to interactive discussions than commits in a file. Thanks for bringing this to the list so we will hopefully had a faster way to come to a conclusion on this issue. Patch 1734 makes a call to the function join_mcast() which doesn't exist unless patch 1478 is also backported. no; that is an svn merge problem which results in a conflict because as you said. in trunk that function has been renamed. the code in patch 1734 itself that is added doesn't call or depend on that function at all and so the resolution for that conflict is to keep the intended change from the patch, and preserve the function name (which is actually from a nearby function even if svn seems to think otherwise) to simplify testing, and provide a patch that could be applied directly to a 3.1 branch and that has the conflict already resolved BUG140 was updated with a patch file. the attachment patch itself can be retrieved from : http://bugzilla.ganglia.info/cgi-bin/bugzilla/attachment.cgi?id=176action=view So if you backport r1734 without also backporting r1478, the gmond code will be left unbuildable with an unresolved function call to a non-existent join_mcast() function. Apparently r1478 renamed this function from mcast_join() to join_mcast(). shouldn't be an issue if using the patch provided, which has the right resolution for the svn merge conflict and that avoids pulling this other unrelated change and that is actually part of another backport proposal that is not meant for this merge window at least. Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] spoofing heartbeats with gmetric broken in 3.1
On Tue, Oct 28, 2008 at 12:25:18PM -0600, Brad Nicholes wrote: On 10/28/2008 at 3:39 AM, in message [EMAIL PROTECTED], Carlo Marcelo Arenas Belon [EMAIL PROTECTED] wrote: 2) is spoofing healthchecks really needed?, considering that the last update from the spoofed host will be updated anyway by the metric report? the use here of healthcheck is incorrect, the issue is for heartbeats as detailed in the subject. The health check needs to be there mainly so that the heartbeat metric shows up for the spoofed box in the XML. every time a spoofed metric is sent, the REPORTED value for the host that is being spoofed will be updated, which is AFAIK the whole point of the heartbeat message anyway. the REPORTED value is tied to the host and not to any specific metric which is where the term heartbeat metric doesn't really fit for a model where the spoofing is a METRIC attribute instead. Once the module spoofing functionality has been accepted for backport, I have an example python module that spoofs the base information such as heartbeat, location, boottime, etc. By just adding this module, you get all of the spoofed base metrics. interesting; could that be the reason why while testing gmetric spoofing in trunk the GMOND_STARTED value was apparently getting updated? haven't yet tracked that bug, as I wanted to focus in the 3.1 code first, but that is also a regression as it will prevent anyone to identify which hosts are being spoofed through it (which was one of Yemi's concerns when this was introduced around 3.0.4) 3) even if using some METADATA with the metric code to indicate the SPOOF_HOST between gmetric and gmond is that EXTRA_ELEMENT needed in the gmond XML? I'm not sure I understand the question. The EXTRA_ELEMENT XML tag is used because spoofing is an extension to the standard metric data just like TITLE and GROUP. right; before there was no XML interface because spoofing was being done at the XDR level, but my concern was directed at why the EXTRA_ELEMENT for SPOOF_HOST was visible from the XML exported from gmond when it has been already processed and it is indeed redundant. it is also strange IMHO that the SPOOF_HEARTBEAT doesn't show if the intention was to keep those EXTRA_ELEMENT in gmond after they were processed. The only way to do without the EXTRA_ELEMENT tag would be to rework the standard tags to include some kind of spoofing attribute. there are already standard XDR tags for spoofing (at least in 3.0) which could be most likely reused for this if needed, but then I am confused of what the rationale was behind using instead EXTRA_ELEMENT tag with 3.1 if using XDR was possible after the XDR refactoring. Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
[Ganglia-developers] spoofing heartbeats with gmetric broken in 3.1
Greetings, while looking at the spoofing code in gmetric noticed the implementation by Yemi to be able to send a spoofed heartbeat by running : gmetric -S ${IP}:${NAME} -H is now considered invalid (since r882) and the implementation in trunk is dropping the metric if -H is used to indicate a healthcheck should be spoofed with : gmetric -n ${METRIC} -v ${VALUE} -t ${TYPE} -S ${IP}:${NAME} -H considering this is a regression (even if I had to admit I am not sure how serious, as spoofing is not something I'd used other than for testing its code), then will be great if someone that knows better this feature could answer the following : 1) other than breaking some scripts by no longer supporting the format used in 3.0, is the longer format needed by 3.1 sufficient? (except of course it is odd to drop the information about the metric used just because -H is also included as in trunk) 2) is spoofing healthchecks really needed?, considering that the last update from the spoofed host will be updated anyway by the metric report? 3) even if using some METADATA with the metric code to indicate the SPOOF_HOST between gmetric and gmond is that EXTRA_ELEMENT needed in the gmond XML? Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] backport proposal for graph statistics, bugzilla 206
On Sat, Oct 18, 2008 at 01:14:32PM -0400, Jesse Becker wrote: Just a note that I've added a backport proposal for bugzilla ID#206 into the 3.1.x branch STATUS file. great, since Timothy did most of the work for it, I am sure he will be interested in commenting about it so inlining the path below for discussion I've consolidated several patches from trunk into a single patch, and posted it. Please review, test and vote. was just looking at that and I have to admit I am not sure why a consolidated patch that diverts from trunk will be needed. couldn't just a merge from all relevant patches in trunk be used for backport?, if there are few minor textual changes why not include them as well to avoid having later conflicts when trying to merge further stuff from trunk? if the changes that were skipped were not good for 3.1, then they are not good for trunk either and that could be fixed with further patches in trunk than then could be added to the list of patches from 3.1 for backport. Carlo --- Index: graph.d/mem_report.php === --- graph.d/mem_report.php (revision 1868) +++ graph.d/mem_report.php (working copy) @@ -30,30 +30,44 @@ $rrdtool_graph['vertical-label'] = 'Bytes'; $rrdtool_graph['extras'] = '--rigid --base 1024'; -$series = DEF:'mem_total'='${rrd_dir}/mem_total.rrd':'sum':AVERAGE -.CDEF:'bmem_total'=mem_total,1024,* -.DEF:'mem_shared'='${rrd_dir}/mem_shared.rrd':'sum':AVERAGE -.CDEF:'bmem_shared'=mem_shared,1024,* -.DEF:'mem_free'='${rrd_dir}/mem_free.rrd':'sum':AVERAGE -.CDEF:'bmem_free'=mem_free,1024,* -.DEF:'mem_cached'='${rrd_dir}/mem_cached.rrd':'sum':AVERAGE -.CDEF:'bmem_cached'=mem_cached,1024,* -.DEF:'mem_buffers'='${rrd_dir}/mem_buffers.rrd':'sum':AVERAGE -.CDEF:'bmem_buffers'=mem_buffers,1024,* - .CDEF:'bmem_used'='bmem_total','bmem_shared',-,'bmem_free',-,'bmem_cached',-,'bmem_buffers',- -.AREA:'bmem_used'#$mem_used_color:'Memory Used' -.STACK:'bmem_shared'#$mem_shared_color:'Memory Shared' -.STACK:'bmem_cached'#$mem_cached_color:'Memory Cached' -.STACK:'bmem_buffers'#$mem_buffered_color:'Memory Buffered' ; +$fmt = '%.1lf'; +$series = 'DEF:mem_total=${rrd_dir}/mem_total.rrd:sum:AVERAGE' +. 'CDEF:bmem_total=mem_total,1024,*' +. 'DEF:mem_shared=${rrd_dir}/mem_shared.rrd:sum:AVERAGE' +. 'CDEF:bmem_shared=mem_shared,1024,*' +. 'DEF:mem_free=${rrd_dir}/mem_free.rrd:sum:AVERAGE' +. 'CDEF:bmem_free=mem_free,1024,*' +. 'DEF:mem_cached=${rrd_dir}/mem_cached.rrd:sum:AVERAGE' +. 'CDEF:bmem_cached=mem_cached,1024,*' +. 'DEF:mem_buffers=${rrd_dir}/mem_buffers.rrd:sum:AVERAGE' +. 'CDEF:bmem_buffers=mem_buffers,1024,*' +. 'CDEF:bmem_used=bmem_total,bmem_shared,-,bmem_free,-,bmem_cached,-,bmem_buffers,-' +. 'AREA:bmem_used#$mem_used_color:Used' +. 'GPRINT:bmem_used:AVERAGE:$fmt%S' +. 'STACK:bmem_shared#$mem_shared_color:Shared' +. 'GPRINT:bmem_shared:AVERAGE:$fmt%S' +. 'STACK:bmem_cached#$mem_cached_color:Cached' +. 'GPRINT:bmem_cached:AVERAGE:$fmt%S\\l' +. 'STACK:bmem_buffers#$mem_buffered_color:Buffered' +. 'GPRINT:bmem_buffers:AVERAGE:$fmt%S' ; + if (file_exists($rrd_dir/swap_total.rrd)) { -$series .= DEF:'swap_total'='${rrd_dir}/swap_total.rrd':'sum':AVERAGE -.DEF:'swap_free'='${rrd_dir}/swap_free.rrd':'sum':AVERAGE -.CDEF:'bmem_swapped'='swap_total','swap_free',-,1024,* -.STACK:'bmem_swapped'#$mem_swapped_color:'Memory Swapped' ; +$series .= 'DEF:swap_total=${rrd_dir}/swap_total.rrd:sum:AVERAGE' +. 'DEF:swap_free=${rrd_dir}/swap_free.rrd:sum:AVERAGE' +. 'CDEF:bmem_swapped=swap_total,swap_free,-,1024,*' +. 'STACK:bmem_swapped#$mem_swapped_color:Swapped' +. 'GPRINT:bmem_swapped:AVERAGE:$fmt%S\\g' +. 'CDEF:bswap_total=swap_total,1024,*' +. 'GPRINT:bswap_total:AVERAGE:/$fmt%S\\g' +. 'CDEF:swap_util=swap_total,swap_free,-,swap_total,/,100,*' +. 'GPRINT:swap_util:AVERAGE: ($fmt%%)\\l' ; } -$series .= LINE2:'bmem_total'#$cpu_num_color:'Total In-Core Memory' ; +$series .= 'LINE2:bmem_total#$cpu_num_color:Total In-Core' ; +$series .= 'GPRINT:bmem_total:AVERAGE:$fmt%S' +. 'CDEF:util=bmem_total,bmem_free,-,bmem_total,/,100,*' +. 'GPRINT:util:AVERAGE:($fmt%% Real Memory Used)\\l' ; $rrdtool_graph['series'] = $series; Index: graph.d/load_report.php === --- graph.d/load_report.php (revision 1868) +++ graph.d/load_report.php (working copy) @@ -19,7 +19,7 @@ $hostname = strip_domainname($hostname); } -
Re: [Ganglia-developers] backport proposal for graph statistics, bugzilla 206
On Sat, Oct 18, 2008 at 01:49:37PM -0400, Jesse Becker wrote: On Sat, Oct 18, 2008 at 13:33, Carlo Marcelo Arenas Belon [EMAIL PROTECTED] wrote: On Sat, Oct 18, 2008 at 01:14:32PM -0400, Jesse Becker wrote: couldn't just a merge from all relevant patches in trunk be used for backport?, if there are few minor textual changes why not include them as well to avoid having later conflicts when trying to merge further stuff from trunk? in the bug report you wrote the following : Updated patch from trunk back to 3.1.x branch. Consolidates most of r1844, r1848, r1850, r1856, and r1857. There are a few minor textual changes from the listed revisions not included in this patch Because I figured it would be easier to review and test applying one patch, instead of the 4-5 that it takes otherwise. The single patch was produced directly from a diff against trunk; no new code is included. I am not asking about why 1 patch was provided (which I agree is really useful for testing) but on why the STATUS change doesn't instead list all patches that are needed (at least from my initial merge attempts based on your instructions I think 1847 is missing) if the changes that were skipped were not good for 3.1, then they are not good for trunk either and that could be fixed with further patches in trunk than then could be added to the list of patches from 3.1 for backport. Nothing was skipped. In fact, given that new stuff goes into trunk *before* it goes into 3.1, I don't really see how it could have been skipped. As I said, there's no new code here. as explained before there was no mention about new code being proposed, but on the contrary, that some of them were probably missing, and that will result in the long run in a divergence between trunk and 3.1 which will affect future merges. Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] [Ganglia-svn] SF.net SVN: ganglia:[1825]trunk/monitor-core/gmond/gmond.c
On Thu, Sep 25, 2008 at 08:15:05AM -0600, Brad Nicholes wrote: [EMAIL PROTECTED] wrote: if (strcasecmp(cb-name, metric_info[i].name) == 0) { - sprintf (modular_desc, %s (module %s), metric_info[i].desc, cb-modp-module_name); + snprintf (modular_desc, sizeof(modular_desc), +%s (module %s), +metric_info[i].desc, +cb-modp-module_name); + desc = (char*)modular_desc; break; } When copying into the buffer, shouldn't the length be sizeof(modular_desc)-1 rather than the full length of the buffer? the length is the maximum allowed number of bytes that will be written in the buffer so sizeof(modular_desc) is a more natural fit since otherwise the buffer will be artificially restricted by 1 byte. It needs to allow for a NULL terminator. snprintf is defined with C99 and I don't have an specification handy but the man page for it in Linux says : The functions snprintf() and vsnprintf() write at most size bytes (including the trailing null byte ('\0')) to str. so the NULL terminator should be included in the length requested, and a quick test with gcc 4.1.2 and the following code shows that it is indeed terminating the buffer and truncating the result as needed to do so. #include string.h #include stdio.h #define BUFFSIZE 4 int main(int argc, char *argv[]) { char buffer[BUFFSIZE]; char *source = test; int n; for (n =0 ; n BUFFSIZE; n++) buffer[n] = 'A'; printf(%s\n, buffer); snprintf(buffer, sizeof(buffer), %s, source); printf(%s\n, buffer); printf(%d\n, buffer[BUFFSIZE]); return (0); } to avoid truncation modular_desc should be large enough but now is 1024 bytes long and most likely big enough already (if probably too big) Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] gmond native Windows binary
On Thu, Sep 25, 2008 at 10:14:14PM +0100, [EMAIL PROTECTED] wrote: I am aware of mingw32 - however, a Cygwin environment provides autoconf and friends. there is also a similar environment for mingw called msys but there is no reason (neither I'd recommend) using it when you can just as well install gcc-mingw in cygwin and have both. Do you propose: a) running configure on Linux and then mingw on Windows, or b) running mingw32 within a Linux host to cross-compile for Windows, or c) is there a complete way to build directly from a fresh SVN checkout on Windows using the mingw32 tools and native autotools executables? not sure what you meant here, but as you explained before ./configure works in cygwin using mingw : # CC=gcc -mno-cygwin ./configure and you can also cross-compile for linux using mingw if inclined, and last I checked you could also bootstrap trunk in cygwin if you wanted to and configure/build libmetrics either as a cygwin library or as a native windows library. Regarding the C++ issues and Posix thread issues: is there any intention to back port this to 3.1.x, or is there another release that these are targetted for? I'm not too worried either way, I just want to be able to focus my efforts in the right versions. the C++ issues and some of the fixes needed to build libmetrics as a native windows library has been proposed for backport into 3.1 for sometime already but have no votes yet, I don't see any reason to keep them only on trunk but in any case all development happens in trunk, so that is where you have to focus anyway and I would imagine this might be material for 3.2 once all pieces are put in place but since fixing DSO support for windows is in the TODO list for 3.1 some of that code will have to be backported anyway. there were no fixes implemented for the Posix thread vs Native thread yet but this issue has been talked about several times before and I remember you mentioning at least once you had done work with some library that could be used to abstract the differences in one of the last threads, but haven't yet read any specifics I could share and I am too lazy to search in the mailing list for links. Regarding libConfuse: can you refer me to any previous postings on the issue of libConfuse static build on Windows? Is this a limitation that originates upstream? libConfuse limitations comes from upstream, which is why the only viable solution for now will be to link it statically (as it is done in cygwin) and unless the upstream version is fixed and we then rely on that yet not existant version. I'm not a big fan of the srclib concept myself, but I don't see why there shouldn't be snapshots of essential dependencies in another part of the SVN repository, not directly under the Ganglia sub-tree. because then you will need to pull all the pieces by hand to build it and then you wouldn't have ever a standalone package (snapshot or release) that could be used. in any case having the build use a srclib provided libconfuse statically if none is available or when instructed by doing --with-libconfuse=internal or something like that is far better than any alternative AFAIK. Also, I remember seeing something about problems running dynamically linked metric modules on Windows - is that the case? If so, is it something that can be overcome if someone is willing to work on it? The APR docs seem to suggest that it is intended to support Windows DLLs. If building a native windows version, linked to a windows APR maybe, but not if building inside cygwin as APR will try then to use the UNIX code and fail. this is IMHO an APR deficiency which shows also when trying to for example build apache using DSO inside cygwin. there is also the problem that in windows (like in AIX) all objects must be resolved at link time and sadly our build process is not clean enough to ensure that yet (even if it works in Linux and other platforms like Solaris/BSD where the dynamic linker does lazy binding to cover for our build process deficiencies) Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers
Re: [Ganglia-developers] Small bug fix for ganglia.spec
On Tue, Sep 30, 2008 at 11:20:47PM +0200, Ulf wrote: Can you just add the missing %defattr(-,root,root,-) to the ganglia.spec. Committed revision 1842. Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-developers mailing list Ganglia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers