Re: [Ganglia-general] Ganglia Web top-level project + versioning
On 11/4/2010 at 6:21 PM, in message aanlkti=oxs0t1fbsf9no5og6phqxcbjxuscm1w9kt...@mail.gmail.com, Bernard Li bern...@vanhpc.org wrote: Hi Brad: [I've changed the subject line to be more reflective of the current discussions] On Thu, Nov 4, 2010 at 8:50 AM, Brad Nicholes bnicho...@novell.com wrote: I'm not sure that we need to physically split the web frontend from the backend as far as the Ganglia project goes. IMO, why not just follow the pattern that we already have in SVN under trunk. Right now we have trunk/monitor-core which includes everything. Could we just create a new directory under trunk called web-frontend and move everything that has to do with the web frontend out of monitor-core and into web-frontend. From that point on, they could both be treated as separate projects with their own release cycles without physically splitting the code into different repositories. Tagging and branches would also work the same way. That's fine. How about versioning? Or am I thinking too much? One potential issue is that ganglia-core would be at 4.0 and ganglia-web will be at 3.5 -- this might cause confusion as to what combination is supported, or vice versa. As far as versioning goes, I think that ganglia-web would just follow its own version scheme. The frontend might have to include some kind of check on the version of the backend to make sure that it is compatible. I'm not sure how flexible the frontend could be, but since all it is doing is consuming XML, I am guessing that it could be fairly flexible when it comes to backward compatibility. I am guessing that the most likely scenario is that a user would upgrade the frontend a lot more frequently than the backend. So there probably wouldn't have to be much need to worry about an older frontend having to support a newer backend. I think it would be a natural thing for a Ganglia user to automatically upgrade the frontend whenever the backend is upgraded. But they would probably upgrade the frontend routinely wthout a backend upgrade. Anyway, yes I think you are thinking too much :-) Documenting compatibility would probably be sufficient. Of course we as the Ganglia developers, wouldn't be able to test every new release of the frontend with every previous release of the backend. But like I said, since the frontend is just consuming XML, it should be flexible enough to handle backwards compatibility. Also the fact that the XML schema isn't expected to change, at least no drastically, within a major version of the backend, backward compatibility should be simple. Brad -- The Next 800 Companies to Lead America's Growth: New Video Whitepaper David G. Thomson, author of the best-selling book Blueprint to a Billion shares his insights and actions to help propel your business during the next growth cycle. Listen Now! http://p.sf.net/sfu/SAP-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Welcoming newest members to the Ganglia Team!
As part of the Ganglia development team, I just wanted to add my welcome to all of the new committers as well. It is always great to see so many community members wanting to pitch in and help move the project forward. Brad On 11/3/2010 at 11:48 PM, in message aanlktimpmvlvbl0cykepoay0uhwvkeuc5v-zldqgv...@mail.gmail.com, Bernard Li bern...@vanhpc.org wrote: Dear all: Hope that you are having a great week so far! I have some exciting news to announce. We have recently added a few active community members to the Ganglia Team to help with upcoming releases and development. Project administrators routinely scout active participants of the Ganglia community and invite them to join our ranks to further the project. Without further ado, please welcome the newest members of our team! Ben Kero (bkero @IRC) joining us from Mozilla, will be helping out with Wiki documentation. Kostas Georgiou (londo @IRC) is the current Fedora packager and should have been added to the team long ago :) Nicolas Brousse (orieg @IRC) has written the gmond PHP DSO module interface and contributed to the gmond Perl DSO module interface. He will act as Release Manager for the upcoming 3.0.8 release. Erik Kastner (kastner @IRC) from Etsy, and Vladimir Vuksan (vvuksan @IRC) from Broadcom have been spearheading the Ganglia Web Frontend re-write, will hopefully be giving our frontend some much overdue facelift. There is always more work that needs to be done, so if you like Ganglia and would like to help out, please do not hesitate to contact us! You can always help by testing new releases, filing bugs, submitting patches, forking our GitHub plugin repositories and answering questions on mailing-lists and web forums. For more information on how the project works, please have a look at our Wiki: https://sourceforge.net/apps/trac/ganglia/wiki/how_project_works A friendly reminder that Matt and I will be holding a Ganglia BoF at LISA '10, details here: http://www.usenix.org/events/lisa10/bofs.html#ganglia Thanks for your attention and continued support for the project! Bernard -- on behalf of the Ganglia Development Team -- The Next 800 Companies to Lead America's Growth: New Video Whitepaper David G. Thomson, author of the best-selling book Blueprint to a Billion shares his insights and actions to help propel your business during the next growth cycle. Listen Now! http://p.sf.net/sfu/SAP-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- The Next 800 Companies to Lead America's Growth: New Video Whitepaper David G. Thomson, author of the best-selling book Blueprint to a Billion shares his insights and actions to help propel your business during the next growth cycle. Listen Now! http://p.sf.net/sfu/SAP-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [Ganglia-developers] IRC chat on Ganglia Web Frontend re-write 10/13/2010 (Wed) 9-10am PDT
I'm not sure that we need to physically split the web frontend from the backend as far as the Ganglia project goes. IMO, why not just follow the pattern that we already have in SVN under trunk. Right now we have trunk/monitor-core which includes everything. Could we just create a new directory under trunk called web-frontend and move everything that has to do with the web frontend out of monitor-core and into web-frontend. From that point on, they could both be treated as separate projects with their own release cycles without physically splitting the code into different repositories. Tagging and branches would also work the same way. The only purpose I see for splitting them into two different projects is to try to grow two different communities (ie. developers with rights to the web project who don't necessarily have rights to the monitor-core project and vice-versa). Given the fact that we don't really have a large developer community, I'm not sure that it would be a good idea to split the community that we have. Brad On 11/4/2010 at 1:15 AM, in message aanlktimset-nck4h0wrktf6dyszsdf1uv_mxxslup...@mail.gmail.com, Bernard Li bern...@vanhpc.org wrote: Hi all: The other day we were talking on IRC regarding how to proceed with this re-write effort for the frontend. In the beginning, I was gung-ho on this re-write from scratch, however, recently Vladimir has been hacking away adding new features to the existing code in trunk. You can get a taste of it here: http://ec2-184-72-167-114.compute-1.amazonaws.com/ganglia-new/ Which got me to thinking... is a re-write from scratch the best approach, or should we just try to keep extending what we have? Another administrative issue that cropped up, is whether to split out Ganglia-Web as a separate project such that it doesn't need to follow the main Ganglia release cycle (since the frontend code is usually backward/forward compatible with Ganglia releases anyway). My idea is to create a new project for the frontend, give it a new name and start with a new version. With that, we can tell users that after Ganglia X, we will no longer be shipping the web component, use Y for that. Another approach is to retain the Ganglia name, but say that after Ganglia 3.2, there are 2 separate projects, ganglia and ganglia-web, in which case ganglia-web will be on a different release cycle than ganglia. Sounds confusing? Yes it is! :) I don't really care either way, as long as it causes the least confusion to the users -- feel free to offer Plan C. Another plan I have in mind is after we create branch-3.2 from trunk, we remove the web component from the code base, in which case all future bug fixes to ganglia-web goes into that branch only, and we will move development to GitHub (just for the frontend). Thanks! Bernard On Thu, Oct 21, 2010 at 11:29 PM, Bernard Li bern...@vanhpc.org wrote: Hi all: Sorry for the delay in posting the log, but I have finally uploaded it. Thanks Jesse for logging: http://therealms.org/oss/ganglia/ganglia_frontend_rewrite_irc_101310.txt I have left the log as is, I just filtered out people's hostnames and stuff. I chopped off at the end when we started discussing outside the scope of the frontend re-write. I will try to summarize the log in the next few days, but if anybody else who was there would like to take a stab at it, please feel free. I think Erik and Vladimir have been hard at work hacking at a Ganglia installation on an AWS instance. We will try to schedule another time to sync up and discuss further (would a phone teleconference be better this time, or should we stick with IRC)? It doesn't look like the hackathon would happen next month. It might become a virtual hackathon but I would really like to put all the developers in a room, but anyway, we'll see. Thanks again for all who showed up, and for all the great discussions. Cheers, Bernard On Wed, Oct 13, 2010 at 11:52 AM, Jesse Becker haw...@gmail.com wrote: I have a log that I will try to clean up and post later today. On Wed, Oct 13, 2010 at 14:46, Dave Josephsen d...@dbg.com wrote: Hey all, Did anyone take minutes? I wasn't able to attend but am interested in hearing about the chat. Thanks -dave - Original Message - From: Bernard Li bern...@vanhpc.org To: ganglia-develop...@lists.sourceforge.net, Ganglia ganglia-general@lists.sourceforge.net Sent: Thursday, October 7, 2010 1:55:26 PM GMT -06:00 US/Canada Central Subject: [Ganglia-general] IRC chat on Ganglia Web Frontend re-write 10/13/2010 (Wed) 9-10am PDT Dear all: I've been talking to people on and off about doing a web frontend re-write, in fact I have been thinking about it since almost three years ago when I started the wishlist thread: http://www.mail-archive.com/ganglia-develop...@lists.sourceforge.net/msg03070.h tml I've managed to gather a group of developers and
Re: [Ganglia-general] Help: Using Ganglia with KVM/QEMU/libvirt ? Any Python DSOs out there?
On 10/20/2010 at 8:22 PM, in message aanlkti=ezmjzo4s6sqyp4m7bdhposhxp2oufagz_z...@mail.gmail.com, Lukas Lundell lukaslund...@gmail.com wrote: Looking to use Ganglia to monitor a virtual linux environment (kvm/qemu). I haven't seen any plugins or Python DSOs for something like libvirt so that Ganglia could get information for the virtualized guest linux instances/domains. Does anyone have any experience doing this? I would like to go about writing an Python DSO to interface with libvirt if there isn't one already out there... I actually wrote a python module a couple of years ago that would collect metrics for each of the VMs running on a host. It would then report the metrics through the spoofing functionality of Ganglia so that each of the VMs would show up in Ganglia just as if there was a gmond agent running on them even through no gmond agent existed in any of the VMs, only on the host. Anyway, the disappointing part of this whole exercise is that there are very few useful metrics that you can gather through libvirt. Most of the metrics are constants such as how much memory or disk space has been allocated to the VM rather than what is the memory utilization, etc. I haven't revisited this module in a couple of years so it might be that there is more useful information that can be gathered now. At the time I was also querying XEN VMs. Brad -- Nokia and ATT present the 2010 Calling All Innovators-North America contest Create new apps games for the Nokia N8 for consumers in U.S. and Canada $10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store http://p.sf.net/sfu/nokia-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] How is the multicpu module to be used?
On 6/16/2010 at 2:10 PM, in message 20100616201041.ga9...@transpect.com, Whit Blauvelt w...@transpect.com wrote: Hi, I've compiled ganglia-3.1.7 on CentOS 5.5. The main thing I'm trying to monitor on our cluster is load on individual CPU cores. It looks like the included multicpu module should do that. I can't find any complete decription of how to set that up. What should I be looking at? In the multicpu.conf file there is a note that says that additional metric definitions should be added to the .conf file for each discovered cpu on the system. You can get all of the cpu definitions for a system by running gmond with a -m parameter. The output of gmond -m will be a list of all of the discovered metrics. Each metric that begins with multicpu_ will need to be defined as a metric in the multicpu.conf file. Then start gmond normally and it should start monitoring all of the cpu metrics for each discovered cpu. NOTE: the next question is usually, if gmond -m can discover the metrics, why do we have to specifically define them as well in the .conf file? The answer to this question has to do with how metrics are enabled through gmond rather than just discovered. There have been previous list discussions about this topic. Brad -- ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] sending float from python module ends up as integer (and very large)
On 5/6/2010 at 5:57 PM, in message l2qd4c731da1005061657xf03acf27x1f1b19b4a7909...@mail.gmail.com, Bernard Li bern...@vanhpc.org wrote: Hi David: On Thu, May 6, 2010 at 3:39 PM, David Birdsong david.birds...@gmail.com wrote: i've since just convereted 1.xx seconds to milliseconds and now i'm pretty happy with int as a precise enough data type. this doesn't explain why the values increment endlessly though when represented as floats and simply converting to ints solves the problem. 'value_type' and 'format' needs to match up in the descriptor. If you wanted to do conversion, do it in your handler, it should work that way. However, I'm not sure why if there is a mismatch between 'value_type' and 'format' it would generate an ambigious value -- any ides Brad? Another thing is, instead of letting the user set the 'format', shouldn't we just hardcode what they should be? The primary place in the code where the value type and format come together is at the point where the value is converted to a string and formatted into the XML tag. In this case, allowing the module to define the format also allows it to specify the precision that will ultimately show up in the XML output from gmond. I don't know if that is really a valuable feature or a good enough reason to not hardcode the format string for a given value type, but that is how it works now. Another place in the code where value type is very important is when the value is pushed through XDR. This is the process which packages a metric into a very small packet which can be passed between systems safely. In order for XDR to create the packet correctly, it has to know exactly what type of data it is dealing with. Otherwise the data will be packaged and unpackaged by XDR using the wrong types and who knows what you will end up with after that. Brad -- ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] sending float from python module ends up as integer (and very large)
On 5/7/2010 at 12:48 PM, in message r2wd4c731da1005071148t4107614fj661b0e3b5a27a...@mail.gmail.com, Bernard Li bern...@vanhpc.org wrote: Hi Brad: On Fri, May 7, 2010 at 7:37 AM, Brad Nicholes bnicho...@novell.com wrote: The primary place in the code where the value type and format come together is at the point where the value is converted to a string and formatted into the XML tag. In this case, allowing the module to define the format also allows it to specify the precision that will ultimately show up in the XML output from gmond. I don't know if that is really a valuable feature or a good enough reason to not hardcode the format string for a given value type, but that is how it works now. Another place in the code where value type is very important is when the value is pushed through XDR. This is the process which packages a metric into a very small packet which can be passed between systems safely. In order for XDR to create the packet correctly, it has to know exactly what type of data it is dealing with. Otherwise the data will be packaged and unpackaged by XDR using the wrong types and who knows what you will end up with after that. I have updated the Wiki page with additional information regarding the format string for the metric. Currently I am referencing the Python format string format, however, it looks like I should be referencing apr_snprintf()'s...? http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_gmond_python_modules Right, apr_snprintf() is used to format the string that is used in the XML tag. Basically if the format string is following the printf() guidelines, it is good. Brad -- ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Problem to include Plugin
On 4/27/2010 at 3:20 AM, in message 1272360012.4619.9.ca...@station3.hq, Patrick Datko patrick.da...@ymc.ch wrote: Hey People, i'm using Ganglia 3.1.2, installed with aptitude, to observe my cluster and it works without any problem. I wanted to integrate a metric which control the traffic of the several nodes, so i build a little python module to check a xml-file which includes the traffic amount. I used sourceforge wiki to build one. I included in my python script the 3 Methods (traffic_handler, metric_init, metric_cleanup) which are required of ganglia and added the following lines to /etc/ganglia/gmond.conf module { name = traffic language = python path = /usr/lib/ganglia/traffic.py } collection_group { collect_every = 10 time_threshold = 50 metric { name = traffic title = Traffic value_threshold = 70 } } But if i restard gmond gmetad the metric still not appears in the webinterface of ganglia. Does anyone has a clou where the Problem is or maybe has the same Problem like me? Have you run your module independent of gmond to make sure that it is functioning correctly? Have you tried starting gmond with a -d 10 command line parameter to force the debug output to the screen? This will usually show you if there is a problem loading the module. Brad -- ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Problem with custom metrics
Actually Bernard is the guru here. thanks Bernard :) On 4/12/2010 at 12:20 PM, in message c7e8dca8.7895%hugo.hernan...@nih.gov, Hernandez, Hugo (NIH/NIAID) [C] hugo.hernan...@nih.gov wrote: Brad, Those changes did the trick. Thanks a lot! Now, I can explore my new metrics to be added. -Hugo On 4/12/10 2:14 PM, Bernard Li bern...@vanhpc.org wrote: Hi Hugo: On Mon, Apr 12, 2010 at 9:51 AM, Hernandez, Hugo (NIH/NIAID) [C] hugo.hernan...@nih.gov wrote: [r...@rocks ~]# python /opt/ganglia/lib64/ganglia/python_modules/hostTemp.py value for tempHost is 8 Try renaming your Python file tempHost.py. That's the name given to your module. You might also want to rename your pyconf tempHost.pyconf for the sake of consistency. Cheers, Bernard -- Hugo R. Hernandez, Contractor Dell Perot Systems Sr. Systems Administrator Mac Linux Server Team, OCICB/OEB National Institutes of Health National Institute of Allergy Infectious Diseases 10401 Fernwood Drive Fernwood West - Rm. 2009 Bethesda, MD 20817 Phone: 301-841-4203 Cell: 240-479-1888 Fax: 301-480-0784 www.dell.com/perotsystems -- Si seus esforços, foram vistos com indefrença, não desanime, que o sol faze un espectacolo maravilhoso todas as manhãs cuando a maior parte das pessoas, ainda estam durmindo - Anónimo brasileiro Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives. -- Download Intel#174; Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Download Intel#174; Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.7 ready for testing
On 3/2/2010 at 4:23 AM, in message 4b8cf534.7090...@pocock.com.au, Daniel Pocock dan...@pocock.com.au wrote: Thanks to those who provided feedback - any objections to making 3.1.7 generally available? I would like to make it GA within the next 1-2 days now. +1 Michael Perzl wrote: I have successfully compiled and tested 3.1.7 on - AIX 5.1 ML04 - AIX 5.3 ML00 - AIX 5.3 TL07 - AIX 6.1 TL03 Regards, Michael On 02/22/2010 12:15 PM, Daniel Pocock wrote: Just a reminder - any feedback is welcome, or feel free to discuss 3.1.7 on IRC It would be good to have positive confirmation of which platforms this has been tested on, so far, I have tested - Debian lenny, - RHEL3/4/5, - CentOS 5, - Solaris 8 and - Cygwin. and Brad has done some testing on SLES10 Regards, Daniel Daniel Pocock wrote: I've tagged 3.1.7 and built a tarball: http://ganglia.info/testing/ganglia-3.1.7.tar.gz The md5sum for 3.1.7 is: 6aa5e2109c2cc8007a6def0799cf1b4c Since 3.1.6, only two things have changed and may need to be tested again by those who tested 3.1.6: - the build system (support for commas in CFLAGS) - the multicpu module - percentages reported differently This is not confirmation that the release is in GA status - a further notification will be sent when the testing period has elapsed without any serious defect. Users are invited to test the tarball and submit feedback. Please do not commit on branches/monitor-core-3.1 until after 3.1.7 goes GA, in case further tweaks are needed to facilitate a successful release. Below are the release notes from the STATUS file. Other documentation has also changed since 3.1.2 and should be reviewed: GANGLIA 3.1 STATUS: -*-text-*- Last modified at [$Date: 2010-02-17 11:01:08 + (Wed, 17 Feb 2010) $] The current version of this file can be found at: * http://ganglia.svn.sourceforge.net/svnroot/ganglia/branches/monitor-core-3.1/ST ATUS Release history: 3.1.7 : Tagged: Feb 17, 2010 3.1.6 : Tagged: Feb 4, 2010 (not released for GA) 3.1.5(hargrave) : Tagged: Nov 24, 2009 (not released for GA) 3.1.4(hargrave) : Tagged: Oct 26, 2009 (not released for GA) 3.1.3(avenger): Tagged: Sep 19, 2009 (not released for GA) 3.1.2(langley): Released: Feb 17, 2009 3.1.1(wien) : Released: Sep 10, 2008 3.1.0(amelia) : Released: Jul 30, 2008 Contributors looking for a mission: * Just do an egrep on TODO, XXX or FIXME in the source. * Review the bug database at: http://bugzilla.ganglia.info/ * Open bugs in the bug database. * Implement a feature from the wishlist at: http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_wish-list CURRENT RELEASE NOTES: (Please update this area with a brief description of bug fixes and enhancements that have been backported for the current release) Note: 3.1.3, 3.1.4, 3.1.5 and 3.1.6 never became GA, therefore, the release notes for all of them are combined below. 3.1.7: * Fix build support for RHEL5/issue with commas in CFLAGS * multicpu module: show CPU utilization as a value between 0-100% for each core 3.1.6: * Merge commit 1966 from trunk to fix contrib/removespikes.pl * Bootstrapping with Debian 5.0 (lenny) versions of autotools for this and future releases. http://www.mail-archive.com/ganglia-develop...@lists.sourceforge.net/msg05352.h tml http://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg04688.html * Require user to explicitly specify sysconfdir when building from source, due to the fact that the old behavior was not consistent with the documented behavior. * Configuration files and scripts are now created during the install phase rather than during configure. This allows values such as @sysconfdir@ to be used in the template configuration files. * Abolish the use of release names - only release numbers will be used to distinguish versions in future * libmetrics: workaround system header conflict in DFBSD= 2.4 (BUG245) * Use PCRE regex matching to configure metrics using the name_match directive * rrdcached support * gmetad now uses apr and the sleep intervals between polls are randomized in a way that supports shorter polling intervals * FreeBSD support: fixes for crashes and disk statistics (BUG153) * Further tweaks to Solaris build support (remove C99 hack) * Eliminate conflict with ncpus symbol name on older Solaris * AIX support: determine if the host is a virtual server (BUG226) * AIX support: setting linker flags (BUG227), add -lm * AIX support: tweaks for AIX= v6.1 * AIX support: revised init scripts for gmond and gmetad * Check for Python.h explicitly * Include the necessary Python files in the
Re: [Ganglia-general] gmetad and RDD problem
On 2/10/2010 at 1:36 AM, in message 70933b58740d5049a7ab96254a66683301a0e...@yaca.intra.cea.fr, GOGUEY-MUETHON Nicolas OSIATIS nicolas.goguey-muet...@cea.fr wrote: Hello , I have lot of log with error like this: Feb 10 09:29:52 SERVEUR /usr/sbin/gmetad[22332]: RRD_update (/var/lib/ganglia/rrds/NOEUDS/__SummaryInfo__/part_max_used.rrd): illegal attempt to update using time 1265790592 when last update time is 1265790592 (minimum one second step) When I have seen this problem it is because the system time on one of your monitored nodes is ahead of the system time of the machine that is running gmetad. Make sure that the system time for all of your machines is in sync. Brad -- SOLARIS 10 is the OS for Data Centers - provides features such as DTrace, Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW http://p.sf.net/sfu/solaris-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.5 beta ready for final testing
On 12/2/2009 at 7:21 AM, in message 4b1677e4.8000...@pocock.com.au, Daniel Pocock dan...@pocock.com.au wrote: I would like gmond to return a non-zero return code if it fails to initialise, e.g. if it is unable to bind or if it is unable to resolve a hostname mentioned in gmond.conf Otherwise, the init-script always says that it started '[OK]' even if the daemon process has died on startup. That is why this change was made. However, I see a few solutions going forward: - we can discard the patch completely - we can discard the patch, and I could write another patch that does some tests (e.g. resolving host names) before daemonizing - we can #ifdef the patch so that on BSD systems, it daemonizes earlier, and on other systems it does so later - we can modify the init script to sleep and then call `ps -C gmond' and determine if it kept running - post the problem on the apr dev list and discuss it there before making any decision I'm not sure that I have anything to add as far as the discussion of this issue goes, but I have commit rights on the APR project. If you go with the last option and take this discussion to the APR-dev list, I can certainly get whatever patch is agreed upon committed and backported in APR. The downside to that option is that we would have to bundle the latest APR RPMs or tarball with Ganglia rather than using the distro version. So even if we do find a solution in APR, we will probably still have to build in a workaround in gmond. Brad -- Join us December 9, 2009 for the Red Hat Virtual Experience, a free event focused on virtualization and cloud computing. Attend in-depth sessions from your desk. Your couch. Anywhere. http://p.sf.net/sfu/redhat-sfdev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [Ganglia-developers] 3.1.4 to go GA?
On 11/20/2009 at 8:07 AM, in message 4b06b0af.1050...@pocock.com.au, Daniel Pocock dan...@pocock.com.au wrote: Brad Nicholes wrote: I've been running it on a very small set of machines. It all looks good to me. No complaints from anyone... is that sufficient to go live? I'm not sure if I have the access level to put the release on the SF site though. You are the release manager. The decision to go live is your call. :) Brad -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [Ganglia-developers] 3.1.4 to go GA?
I've been running it on a very small set of machines. It all looks good to me. Brad On 11/18/2009 at 9:42 AM, in message d4c731da0911180842x74ecc2c3p2f440e9c521d7...@mail.gmail.com, Bernard Li bern...@vanhpc.org wrote: I haven't had a chance to test it out yet -- has anybody else been able to give it a spin? Cheers, Bernard On Wed, Nov 18, 2009 at 7:22 AM, Daniel Pocock dan...@pocock.com.au wrote: How do people feel about making 3.1.4 GA? -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-developers mailing list ganglia-develop...@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-developers mailing list ganglia-develop...@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-developers -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Ganglia cannot find a data source.
On 11/17/2009 at 10:04 AM, in message b1eec58d0911170904r2f2613ads9244341a82b85...@mail.gmail.com, Ryan Robertson 89esp...@gmail.com wrote: I too have been bangin my head on this for a few weeks. After much googling i cannot seem to find the answer, so i hope someone (developer maybe) can help. I was successfully using ganglia 2.5 and 3.0.x. At some point i upgraded to 3.1.x and things went sour. I've even tried to revert back to a known working condition to no avail. So here's my current setup. GMETAD 3.1.4 running under suse 11.1 ppc. Using a basic gmetad.conf file monitoring itself (localhost) for troubleshooting purposes: ---snip from /etc/gmetad.conf --- data_source my cluster localhost gpipnim01 data_source sap_app gpiptcpeap02 ---snip- XML on localhost seems fine. I can telnet to localhost 8469 and get proper results. FWIW : GANGLIA_XML VERSION=3.1.4 SOURCE=gmond RRD's are updating properly in /var/lib/ganglia/rrds/ gmond (on localhost) in debug mode is sending updates (obviously since RRD's are being created). gmond -m shows modules are loaded. Web frontend: When I hit the webpage i get Ganglia cannot find a data source. Is gmond running? When you telnet to 8652 what do you get? Localhost 8649 is the output from gmond on localhost. Localhost 8652 is the interactive port from gmetad which is the port that the web frontend uses to get the metric data. Brad -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Ganglia cannot find a data source.
Sounds to me like it could be a file permissions problems then. Is your apache server able to access the rrd files and/or port 8652? On 11/17/2009 at 1:00 PM, in message 0016e64c2536e598710478969...@google.com, 89esp...@gmail.com wrote: Ahh yes, i knew there was one other telnet snippet question. I am able to telnet to localhost 8652 and feed it /?filter=summary I get outputthe output scrolled off the screen, but you get the idea that it's returning... --snip- /METRICS METRICS NAME=swap_total SUM=2019320 NUM=1 TYPE=double UNITS=KB SLOPE=zero SOURCE=gmond EXTRA_DATA EXTRA_ELEMENT NAME=GROUP VAL=memory/ EXTRA_ELEMENT NAME=DESC VAL=Total amount of swap space displayed in KBs/ EXTRA_ELEMENT NAME=TITLE VAL=Swap Space Total/ /EXTRA_DATA /METRICS METRICS NAME=part_max_used SUM=40.2 NUM=1 TYPE=double UNITS=% SLOPE=both SOURCE=gmond EXTRA_DATA EXTRA_ELEMENT NAME=GROUP VAL=disk/ EXTRA_ELEMENT NAME=DESC VAL=Maximum percent used for all partitions/ EXTRA_ELEMENT NAME=TITLE VAL=Maximum Disk Space Used/ /EXTRA_DATA /METRICS /CLUSTER /GRID /GANGLIA_XML snip -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Ganglia install instructions wiki link broken
On 11/12/2009 at 8:57 AM, in message 4AFC3066.521 : 172 : 26400, Brad Nicholes wrote: On 11/12/2009 at 6:12 AM, in message f7b2d28a-290a-4142-8f13-6034d55c2...@beforedawnsolutions.com, John Martyniak j...@beforedawnsolutions.com wrote: First off is that the best way to install Ganglia? Is there a better way? Better instructions? I checked the documentation on ganglia.org and the wiki links are broken, so couldn't get any install instructions there. I ran into this myself the other day. The document links on our wiki page (http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_gmond_con figuration) are tied to the sourceforge docman which I think, went away. Therefore the links to the install docs are now broken. At one point I had put the latest ganglia and gmond installation and configuration .html pages in docman so that we could reference them from the wiki. Since that is now broken, is there some where else where we can but the install and configuration pages? These documents could possibly change with each release of Ganglia. Since they are generated docs that are included in the distro tarball, they would need to be uploaded to the wiki document reference location after each release. Brad I think I have all of the wiki page links fixed up. Especially on the installation and configuration page. I also fixed up some links to the misc. documents about Ganglia and monitoring. If anyone discovers any other broken links on the wiki, please let me know. Brad -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] XML errors: XML_ParseBuffer() error at line 272: not well-formed(invalid token)
On 11/12/2009 at 8:11 PM, in message b791204d0911121911t5628f609s88f339567d104...@mail.gmail.com, chifeng chif...@gmail.com wrote: Hi folks, I got a XML errors in Ganglia v3.1.2. It looks like this ticket: http://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg05054.htm l Here is my errors information: Nov 12 13:43:18 labmonitor /usr/local/ganglia/sbin/gmetad[14362]: Process XML (BJQA1): XML_ParseBuffer() error at line 499: not well-formed (invalid token) What would be nice is to change the error message in ganglia.php so that it produced the actual XML line or better yet, dumped the whole parse buffer to a file. I have heard of this problem happening with later versions of apache mod_php. Since it seems to happen sporadically and everything is being processed in memory, it would be nice to see exactly what the xml parser is complaining about or if this is a bug in mod_php. Brad -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] special metric names in diskusage.pyconf file
No, there is no solution for this yet. The problem is that gmond does not yet provide a way for metrics to be dynamically defined. Every metric has to be defined through the configuration file first. I'm sure that there a many solutions to this problem, but nobody has stepped up to tackle this one yet. I don't believe that this one will be a quick fix. Everything is designed around full configuration through the config file. Solving this problem will require some architectural redesign. Brad On 10/21/2009 at 9:54 AM, in message 68fea9390910210854r3b116748j9c33a7f4c5ac4...@mail.gmail.com, Matt mattmora...@gmail.com wrote: Is there any solution to this? It would be really beneficial to work out the metrics we want to publish in the python code rather than supplying them up front in the pyconf file. 2009/7/15 Brad Nicholes bnicho...@novell.com: On 7/14/2009 at 4:36 PM, in message 4120cbd6bbd82647b89d6a70694510bed1c...@exchange02.presidio.alexa.com, Guolin Cheng guo...@alexa.com wrote: Hi, Any one knows what the metric name disk_used-metric-name stands for? The stanza is from diskusage.pyconf file, ganglia version 3.1.1/2. collection_group { ... metric { name = disk_used-metric-name ... } ... It looks like that the name stands for a series of metrics output from associated python module, but not sure what is the playing rule behind. Any one can shed a light into this? Thanks a lot. It is a place holder for the actual metric name that isn't determined until you run gmond. Seems like a chicken and egg thing, but what you have to do is run gmond with the -m parameter first. This will give you a list of all of the possible metrics including those that come from diskusage.py. Then extract the actual diskusage metric names from the list and plug them into the .conf file. In most cases you will have to create additional metric blocks in the .conf file for each diskusage metric. Then start gmond normally and the individual disk usage metrics will be collected as expected. Brad -- Enter the BlackBerry Developer Challenge This is your chance to win up to $100,000 in prizes! For a limited time, vendors submitting new applications to BlackBerry App World(TM) will have the opportunity to enter the BlackBerry Developer Challenge. See full prize details at: http://p.sf.net/sfu/Challenge ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Ganglia 3.1.3 beta ready for testing
On 10/1/2009 at 4:33 PM, in message d4c731da0910011533p2d337d0ajc80ea158d2a7...@mail.gmail.com, Bernard Li bern...@vanhpc.org wrote: So has anybody else given 3.1.3 a test run? I have found some minor issues. It looks like there are new configure options added in regards to setuid and setgid: --enable-debug turn on debugging output and compile options --enable-gexec turn on gexec support (platform-specific) --enable-setuid=USER turn on setuid support (default setuid=nobody) --enable-setgid=GROUP turn on setgid support (default setgid=daemon) There are 2 issues: - extra quotation marks in the text - --enable-setuid is OFF by default. This is the opposite behaviour from previous released versions On top of that, our spec file has not been updated with this new configure option and therefore the RPMs I posted do *not* setuid. I'm not sure if we should consider this as show stopper, but a simple fix would simply be to change the default configure option so that it reflects the previous behaviour. Please let me know what you guys think. If this is just a simple fix, then I would vote for scraping 3.1.3, rolling 3.1.4 with the fix and resetting the test period. The other option, since this isn't a regression, would be to release 3.1.3 as is with the defect noted in the release notes. Then release 3.1.4 next month with the fixes. I would vote for the first option, but I'm OK with the second if that is the way everybody else wants to go. Brad -- Come build with us! The BlackBerryreg; Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9#45;12, 2009. Register now#33; http://p.sf.net/sfu/devconf ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] How to remove a gmetric-added metric?
No, but you should be able to get the same results by setting host_dmax in the gmond.conf file. Brad On 9/2/2009 at 1:37 AM, in message 68fea9390909020037y2094d15es42bd13da3ea0...@mail.gmail.com, Matt mattmora...@gmail.com wrote: Is there such a thing as dmax in the python interface? 2009/9/2 Rick Cobb rc...@quantcast.com: And just to keep the ganglia-general list hopping: is there any reason the default for gmetric is to send with a dmax of *infinity* (0 is the syntax for that)? We*ve patched ours internally to default to 60s tmax, 600s dmax for exactly the reason that people often have bugs in their gmetric scripts, and trying to work around dead metrics on lots of machines is painful. -- ReC On 9/1/09 11:47 PM, Rick Cobb rc...@quantcast.com wrote: That sentence should have read *once the dmax passes, gmond will stop sending the metric.* Eventually, gmetad also considers the metric to have expired and stops reporting it to the GUI, so the chart goes away. Then all you need to do is remove the rrdtool file. Sorry for the premature *send* -- -- ReC On 9/1/09 6:34 PM, Rick Cobb rc...@quantcast.com wrote: If you*re just trying to get it to expire from your gmond/gmetad XML, just send it again with a non-zero tmax dmax. Once the dmax passes, the Once the graph disappears remove the rrdtool file using *rm*. -- ReC On 9/1/09 8:10 AM, Raimund Eimann raim...@local.ch wrote: Hi, is there an easy way to get rid of a metric that was added with gmetric in the first place? Cheers, Raimund -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] python metric modules
Why can't you just do the following: # webserver.pyconf modules { module { name = lsof language = python param httpd { value = doesnt-matter } param crawler { value = doesnt-matter } } } collection_group { collect_every = 30 time_threshold = 60 metric { name = lsof-httpd title = Web Server open files value_threshold = 1.0 } metric { name = lsof-crawler title = crawler open files value_threshold = 1.0 } } Then just iterate through all of the params that your module receives and construct a unique metric name by appending the param name to the module name. Then create a separate descriptor for each unique metric name and pass that back. Gmond will generate a metric for each of the descriptors and call your module to gather each one individually. This way you have one python module that is loaded once but generates metrics for each process that you specify in the config file. Brad On 9/2/2009 at 2:15 AM, in message 68fea9390909020115y1485f0b2j8056ae00b36be...@mail.gmail.com, Matt mattmora...@gmail.com wrote: Hi all, I'm trying to create my python metric modules as versatile as possible, but i'm not sure exactly how they are supposed to be used. For instance, i've got a python script that grabs open files for a process. Can I invoke this module multiple times with different pyconf files? and if so, how do I change the metric name? I don't really want to use multiple descriptors, as I don't want to check for processes I know that are not on the server. example: The lsof.py script has a descriptor for lsof, can I use the same lsof.py for two pyconf files? won't the metrics both show up as lsof? Or do I need a different lsof.py for every process/metric I want to capture? # webserver.pyconf modules { module { name = lsof language = python param name { value = httpd } } } collection_group { collect_every = 30 time_threshold = 60 metric { name = lsof title = Web Server open files value_threshold = 1.0 } } #crawler.conf modules { module { name = lsof language = python param processname { value = crawler } } } collection_group { collect_every = 30 time_threshold = 60 metric { name = lsof title = crawler open files value_threshold = 1.0 } } Thanks, Matt -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] issues with upgrade to 3.1
As Bernard mentioned, take a look at the upgrade release notes Please see the section Upgrading from 3.0 in the 3.1.x release notes: http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_release_notes you can't mix gmond 3.0.x and gmond 3.1.x in the same cluster. All of the gmond nodes within a cluster have to be upgraded at the same time. Brad On 8/27/2009 at 2:33 PM, in message 4a96edad.5070...@phys.ufl.edu, Yu Fu y...@phys.ufl.edu wrote: I have just upgraded the frontend with gmetad to 3.1.2, but the Ganglia web display is still the same. That is, the data of nodes with Ganglia 3.1.2 never got changed/updated with constant straight lines displayed in the plots while nodes with Ganglia 3.0.7 are just fine in the plots. What is wrong? Thanks, Yu Yu Fu wrote: No, I only upgraded the compute nodes. The machine running gmetad is still old 3.0.7. Yu Bernard Li wrote: Hi Yu: On Thu, Aug 27, 2009 at 9:23 AM, Yu Fuy...@phys.ufl.edu wrote: I upgraded Ganglia from 3.0.7 to 3.1.2 on some compute nodes while the ones on other nodes keep unchanged. I know the gmond.conf has changed in 3.1 so I created new gmond.conf for those upgraded machines from a template obtained from gmond -t. Things looked fine until I noticed that the upgraded nodes' data never changed/updated in the Ganglia plots even after clicks on Get Fresh Data button. As a result, all plots of upgraded nodes just show a flat straight line with initial values. Is this expected or something wrong? Do I have to upgrade all nodes at the same time? Have you upgraded your gmetad to 3.1.2 as well? Please see the section Upgrading from 3.0 in the 3.1.x release notes: http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_release_notes Cheers, Bernard -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] ganglia returns wrong value for python module
On 8/20/2009 at 9:01 AM, in message 68fea9390908200801m3e1f43ecy2c33e743ccc0d...@mail.gmail.com, Matt mattmora...@gmail.com wrote: Hi all, I'm getting inconsistent results when gmond is running my python module # gmond --version gmond 3.1.2 # ./lsof.py (0, '2699') 2699 2700 1000 # gmond -d1 (0, '535') 551 552 1000 --- python code --- #!/usr/bin/python import commands def lsof_handler(name): cmd = 'lsof -p 3123 | wc -l' lsof = commands.getstatusoutput(cmd)[1] print commands.getstatusoutput(cmd) print lsof print int(lsof) + 1 lsof = 1000 print lsof Any ideas? i've got a similar function that uses commands.getstatusoutput() to grab the pid of a process which works fine. I would suggest adding something like the following to your python module and then running it as a stand alone script outside of gmond to see what your module is returning. I haven't see any problems like this with any other python module. #This code is for debugging and unit testing if __name__ == '__main__': # If your module expects configuration parameters, adjust # this line to match the expected parameters params = {'RandomMax': '500', 'ConstantValue': '322'} # Manually call the metric_init function metric_init(params) # Manually call the callback that where defined in the # metric descriptor. Then prin the return values. for d in descriptors: v = d['call_back'](d['name']) print 'value for %s is %u' % (d['name'], v) -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] special metric names in diskusage.pyconf file
On 7/14/2009 at 4:36 PM, in message 4120cbd6bbd82647b89d6a70694510bed1c...@exchange02.presidio.alexa.com, Guolin Cheng guo...@alexa.com wrote: Hi, Any one knows what the metric name disk_used-metric-name stands for? The stanza is from diskusage.pyconf file, ganglia version 3.1.1/2. collection_group { ... metric { name = disk_used-metric-name ... } ... It looks like that the name stands for a series of metrics output from associated python module, but not sure what is the playing rule behind. Any one can shed a light into this? Thanks a lot. It is a place holder for the actual metric name that isn't determined until you run gmond. Seems like a chicken and egg thing, but what you have to do is run gmond with the -m parameter first. This will give you a list of all of the possible metrics including those that come from diskusage.py. Then extract the actual diskusage metric names from the list and plug them into the .conf file. In most cases you will have to create additional metric blocks in the .conf file for each diskusage metric. Then start gmond normally and the individual disk usage metrics will be collected as expected. Brad -- Enter the BlackBerry Developer Challenge This is your chance to win up to $100,000 in prizes! For a limited time, vendors submitting new applications to BlackBerry App World(TM) will have the opportunity to enter the BlackBerry Developer Challenge. See full prize details at: http://p.sf.net/sfu/Challenge ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmond returning XML with large negative TNvalues(ganglia 3.1.2, linux x86_64)
On 7/13/2009 at 1:06 AM, in message d9c3f61a0907130006q5cdf7d8fg85ed8ea7f7ea3...@mail.gmail.com, Pavel Shevaev pacha.shev...@gmail.com wrote: Hi folks, Looks like gmetad ignores reports from gmond returning records with large negative TN values. gmond started to behave like that after the computer was restarted. Here's a sample of gmond's output acquired with nc localhost 8649: GANGLIA_XML VERSION=3.1.2 SOURCE=gmond CLUSTER NAME=host1 LOCALTIME=1247467796 OWNER=BIT LATLONG=unspecified URL=unspecified HOST NAME=localhost IP=127.0.0.1 REPORTED=1247478928 TN=-11132 TMAX=20 DMAX=0 LOCATION=unspecified GMOND_STARTED=1247478927 METRIC NAME=tcp_closed VAL=0 TYPE=uint32 UNITS=Sockets TN=-11143 TMAX=20 DMAX=0 SLOPE=both ... /METRIC I believe these large negative TN values somehow make gmetad, gstat, etc think the host is down. Here's what gstat says: CLUSTER INFORMATION Name: host1 Hosts: 0 Gexec Hosts: 0 Dead Hosts: 1 Localtime: Mon Jul 13 10:55:02 2009 But gmond is definitely alive, here's some output from strace: $ sudo strace -p 15911 Process 15911 attached - interrupt to quit epoll_wait(3, {{EPOLLIN, {u32=7117640, u64=7117640}}}, 2, 10627587) = 1 accept(5, {sa_family=AF_INET, sin_port=htons(42589), sin_addr=inet_addr(192.168.4.10)}, [140733193388048]) = 7 write(7, ?xml version=\1.0\ encoding=\ISO..., 2489) = 2489 write(7, GANGLIA_XML VERSION=\3.1.2\ SOUR..., 45) = 45 After restarting gmond everything becomes fine. Any ideas on what can be the reason of such a strange behavior? The only thing that I know of that would cause this behavior is if the system clocks on your various node are out of sync. TN report the time stamp offset between the time that the metric was actually gathered and the time that it is being reported to gmetad. If the system clock on the node that is gathering the metric is ahead of the system clock on the node that is reporting the metrics to gmetad, the calculation that determines the TN can go negative. Check to make sure that all of the system clocks on the nodes running gmond are all in sync. Brad -- Enter the BlackBerry Developer Challenge This is your chance to win up to $100,000 in prizes! For a limited time, vendors submitting new applications to BlackBerry App World(TM) will have the opportunity to enter the BlackBerry Developer Challenge. See full prize details at: http://p.sf.net/sfu/Challenge ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] No rrd file being created for metrics
You mentioned the udp_send_channel configuration but did you set up the udp_recv_channel? Gmond has to be able to listen to itself as well as everybody else in order to collect the metrics that will be reported to gmetad. Brad On 6/30/2009 at 12:46 AM, in message 93f24cb40b05984b8c87edfe5773374d3caa95b...@lonmc01010.rbsres07.net, wayne.pas...@rbs.com wrote: All, Sorry to bump this, but does anyone have any ideas on a solution for this ? Regards, Wayne Pascoe RBS Global Banking Markets Office: +44 20 3361 9571 | Mobile: +44 7799 707450 -Original Message- From: PASCOE, Wayne, GBM Sent: 26 June 2009 11:24 To: ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] No rrd file being created for metrics Thanks for your reply - hopefully this will clarify the missing bits of information. -Original Message- Your email is missing some relevant information. gmond doesn't store metrics into RRD files, it only makes the metric information available via XML. gmetad is the process that gets these metrics (from gmond nodes) and stores them on RRD files on disk. You seem to be debugging under the assumption that your problem is that the collector gmond isn't getting the metrics from the other gmond, but you haven't given us information to indicate that you've verified that. To clarify, I will call our regular collecter 'BoxA' and the host with the disk metrics 'WinBox'. BoxA is a Linux host running Ganglia 3.0.7 and WinBox is a windows host running Ganglia 3.0.7 When I setup WinBox as my cluster collector node in gmetad, all of the metrics that I expect to be present are there and RRD files are created for them. They are only missing if I configure WinBox to send to BoxA and setup BoxA as my collecter in gmetad.conf Working config: WinBox gmond.conf udp_send_channel { host = WinBox port = 8649 } gmetad.conf data_source My Cluster WinBox.mydomain.com Failing config: WinBox gmond.conf udp_send_channel { host = BoxA port = 8649 } BoxA gmond.conf udp_send_channel { host = BoxA port = 8649 } gmetad.conf data_source My Cluster BoxA.mydomain.com So: - Can you look at the XML from your collector node (tcp port 8649) and check if it's actually missing the metrics? The metrics are there, and appear in gmetad when using WinBox as my collector. When WinBox is configured to send to itself (see working configuration above), telnetting to 8469 on the server shows those metrics are present. When I set BoxA to be the collecter and configure WinBox to send to that, the metrics are NOT present when I telnet to BoxA 8649. - Is your gmetad polling the correct collector nodes? Yes, it is - in both scenarios, WinBox appears appears in the collection in Gmetad At this point, it looks as if the data is missing when it is sent to BoxA, the Linux collector. At this point, I cannot see the metrics in the XML output any more, so it makes sense that it never arrives at the gmetad host. On this basis, where do I go to investigate this further ? On BoxA, sbin/gmond -m does not include any of the metrics that I am collecting on my Windows box and wish to appear in my cluster - is this relevant ? Do these metrics have to be supported by this gmond to be collected? If so, how can I configure them, given that my earlier attempt to include them caused gmond to fail with the following message: Unable to collect metric 'phys_disk_time' on this platform. Exiting. Thanks in advance for any assistance that anyone can give :) -- Wayne Pascoe RBS Global Banking Markets Office: +44 20 3361 9571 | Mobile: +44 7799 707450 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general *** The Royal Bank of Scotland plc. Registered in Scotland No 90312. Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. Authorised and regulated by the Financial Services Authority. This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer. Internet e-mails are not necessarily secure. The Royal Bank of Scotland plc does not accept responsibility for changes made to this message after it was sent. Whilst all reasonable care has been taken to avoid the transmission of viruses, it is the responsibility of the recipient to ensure that the onward transmission, opening or use of this message and any attachments will not adversely affect its systems or data. No responsibility is accepted by The
Re: [Ganglia-general] Gmond strange metric TN too large issue ormulticast metric lost?
The TN value is simple indicating the time offset from the reported timestamp that the metric was last received from the managed node. In other words it is the age of the metric. A large number would indicated that the metric value has not been updated for a long period of time. This might be because the reporting interval has been set to a very large time period or it may have something to do with multicast packets being lost in your network. I would suggest that you run gmond in debug mode -d 10 on both the managed node and the collecting node and then try to correlate when the reporting node send one of your custom metrics to when or if the collecting node received it. The most obvious thing that this would tell you is if your multicast packets are being lost or blocked by a router in your network. If the collecting node is actually receiving the packet in a timely manner but the TN value is still large, then we would have to look at a possible bug in gmond. The fact that you seem to be losing only the python metrics seems to indicate that this might be either a configuration error or a problem with the metric definition of the custom python metric. Do you have the same problem with any of the standard shipping python metrics? Brad On 6/24/2009 at 9:39 AM, in message bay140-w76585dd65f08f17a35d1bb3...@phx.gbl, liangfan xfanli...@hotmail.com wrote: I'm trying to figure out some very puzzling issue in our ganglia system. We are using ganglia 3.1.1.We get strange issue that some metric always have too large TN. The system is configured as: -We have gmond deployed on 16 nodes (A-001--A008, B001--B008). -Gmond is configured to use multicast mode, each node have all metrics The issues are: -TN on some nodes is ok, while others have errors. -Some metric of one host are too large, while other metrics of the same node are ok. We guess kernel may drop these packages. You can see the detailed analysis in the end. I find a thread on mail list may relate to this: http://www.mail-archive.com/ganglia-develop...@lists.sourceforge.net/msg02942. html MonAMI also has a page might relate to this: http://monami.sourceforge.net/tutorial/ar01s06.html In preventing metric-update loss, it says that: The current Ganglia architecture requires each metric update be sent as an individual metric-update message. On a moderate-to-heavily loaded machine, there is a chance that gmond may not be scheduled to run as the messages arrive. If this happens, the incoming messages will be placed within the network buffer. Once the buffer is full, any subsequent metric-update messages will be lost. This places a limit on how may metric update messages can be sent in one go. For 2.4-series Linux kernels the limit is around 220 metric-update messages; for 2.6-series kernels, the limit is around 400. However, we are still confusing about the symptoms: -We do not see much buffer in port 8649 REV_Q and our node are not heavy load. -Why all the core metric are received and update to now ,while almost all the custom python metric are lost and TN get too large? -Why some node always gets outdated custom python metric, while other nodes are ok? I've been scratching my head over this for almost a week now; I’ve searched ganglia mailing list archives, but can not get more info. Any help/suggestions/advice would be very much appreciated -- it's really very frustrating! Below is the detailed analysis -Detailed analysis Here is part of the xml out from A-001. telnet localhost 8649: HOST NAME=B-002 IP=X.X.X.119 REPORTED=1245822864 TN=9 TMAX=20 DMAX=0 LOCATION= GMOND_STARTED=1245710345 METRIC NAME=proc_run VAL=0 TYPE=uint32 UNITS= TN=45 TMAX=950 DMAX=0 SLOPE=both EXTRA_DATA EXTRA_ELEMENT NAME=GROUP VAL=process/ /EXTRA_DATA /METRIC METRIC NAME=load_five VAL=1.13 TYPE=float UNITS= TN=7 TMAX=325 DMAX=0 SLOPE=both EXTRA_DATA EXTRA_ELEMENT NAME=GROUP VAL=load/ /EXTRA_DATA /METRIC METRIC NAME=WritesPerSec VAL=0.00 TYPE=float UNITS= TN=112225 TMAX=60 DMAX=0 SLOPE=both EXTRA_ELEMENT NAME=GROUP VAL=Status/ /EXTRA_DATA /METRIC METRIC NAME=db_used VAL=20233 TYPE=uint32 UNITS= TN=112225 TMAX=60 DMAX=0 SLOPE=both EXTRA_DATA EXTRA_ELEMENT NAME=GROUP VAL=Status/ /EXTRA_DATA /METRIC From the xml, we can see gmond gets heartbeat info from B-002.TN of all the core metric collected by gmond (ex, proc_run,l oad_five) are ok,while TN of most metrics collected by our python module(ex, WritesPerSec, db_used) extension are large(TN=112225). We use tcpdump on B-002 and find B-002 send out all the metric to multicast address(X.X.X.119--239.X.X.110:8649). On A-001, we find A-001 receive all the multicast message accordingly.(Get the same X.X.X.119--239.X.X.110:8649 message in tcpdump). These means the multicast message reaches to A-001. Then we look use strace to trace gmond and find: On B-002 gmond send out all
Re: [Ganglia-general] metric_cleanup not being called in my pythonmodule
On 5/26/2009 at 7:22 PM, in message dcccdf790905261822w63e3447crf21bef5390b...@mail.gmail.com, David Birdsong david.birds...@gmail.com wrote: I've been searching around for awhile now, any suggestions on where I can get apr-debug ...or is it apr-util-debug? I'm on Fedora 8, but source is fine. I'm not sure where to find a built RPM. I have always just built it myself. You want a debug version of APR. Gmond doesn't use apr-util. Brad On Tue, May 26, 2009 at 8:35 AM, Brad Nicholes bnicho...@novell.com wrote: On 5/24/2009 at 12:43 AM, in message dcccdf790905232343y76481e5dw6c1df62bc732c...@mail.gmail.com, David Birdsong david.birds...@gmail.com wrote: I have a python module that spawns a separate thread that collects data off of a pipe. Everything runs fine, but I'm finding that metric_cleanup is never called. When I strace the PID of the worker thread(in Linux so it get's it's own PID), I see it gets a SIGTERM when I stop gmond instead of exiting under it's own power. All of the gmond processes exit, but a subprocess of my worker thread just ends up being reparented to init instead of being cleaned up by my metric_cleanup logic. The worker thread reads from an endless pipe using a select.poll with a timeout, so the pipe shouldn't block. I need to know to kill the process on the other end of the pipe which is what metric_cleanup should be providing. I even removed all cleanup code from metric_cleanup() and just put an open('/tmp/ganglia_kill', 'w'),...but no file is created. What can I investigate to understand why it's being ignored? Gmond depends on the APR memory pools for invoking the cleanups. Basically the way it works is that when gmond starts up, it creates an APR memory pool. This memory pool is used to allocate and manage memory of everything in gmond that deals with APR. One of the features of APR memory pools is that I can tie functions to a pool that get invoked with the memory pool is cleaned up. In this case in the function setup_metric_callbacks() in gmond.c, it is tying all of the module cleanup functions to the main global memory pool. When gmond exits, the last thing that happens is that the memory pools that were created, are destroyed. This should trigger all of the cleanup routines. To debug this, you will need a debug version of APR and set a break point in apr_terminate(). Also set a break point in apr_pool_destroy. These functions should be getting called automatically when the gmond process shuts down. Another quick workaround would be to explicitly call apr_pool_destroy (global_context) as the last statement in main.c. This will force the destruction of the global memory pool which should also cause the clean up routines to be called. Brad -- Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT is a gathering of tech-side developers brand creativity professionals. Meet the minds behind Google Creative Lab, Visual Complexity, Processing, iPhoneDevCamp as they present alongside digital heavyweights like Barbarian Group, R/GA, Big Spaceship. http://p.sf.net/sfu/creativitycat-com ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] metric_cleanup not being called in my pythonmodule
On 5/24/2009 at 12:43 AM, in message dcccdf790905232343y76481e5dw6c1df62bc732c...@mail.gmail.com, David Birdsong david.birds...@gmail.com wrote: I have a python module that spawns a separate thread that collects data off of a pipe. Everything runs fine, but I'm finding that metric_cleanup is never called. When I strace the PID of the worker thread(in Linux so it get's it's own PID), I see it gets a SIGTERM when I stop gmond instead of exiting under it's own power. All of the gmond processes exit, but a subprocess of my worker thread just ends up being reparented to init instead of being cleaned up by my metric_cleanup logic. The worker thread reads from an endless pipe using a select.poll with a timeout, so the pipe shouldn't block. I need to know to kill the process on the other end of the pipe which is what metric_cleanup should be providing. I even removed all cleanup code from metric_cleanup() and just put an open('/tmp/ganglia_kill', 'w'),...but no file is created. What can I investigate to understand why it's being ignored? Gmond depends on the APR memory pools for invoking the cleanups. Basically the way it works is that when gmond starts up, it creates an APR memory pool. This memory pool is used to allocate and manage memory of everything in gmond that deals with APR. One of the features of APR memory pools is that I can tie functions to a pool that get invoked with the memory pool is cleaned up. In this case in the function setup_metric_callbacks() in gmond.c, it is tying all of the module cleanup functions to the main global memory pool. When gmond exits, the last thing that happens is that the memory pools that were created, are destroyed. This should trigger all of the cleanup routines. To debug this, you will need a debug version of APR and set a break point in apr_terminate(). Also set a break point in apr_pool_destroy. These functions should be getting called automatically when the gmond process shuts down. Another quick workaround would be to explicitly call apr_pool_destroy (global_context) as the last statement in main.c. This will force the destruction of the global memory pool which should also cause the clean up routines to be called. Brad -- Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT is a gathering of tech-side developers brand creativity professionals. Meet the minds behind Google Creative Lab, Visual Complexity, Processing, iPhoneDevCamp asthey present alongside digital heavyweights like Barbarian Group, R/GA, Big Spaceship. http://www.creativitycat.com ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] CVE-2009-0241
On 3/10/2009 at 1:14 PM, in message m3fxhlchqn@unna.nsc.liu.se, Leif Nixon ni...@nsc.liu.se wrote: Linkoping University The issue has been there for a while. See the associated bug report. Also since it is an issue with the interactive port, the attacker would have to have access to that port. In most cases the port should have been protected by either the trusted_hosts configuration or the fact that gmetad should be running behind a firewall which would prevent external access. Since only the web front end (ie. apache) and/or a downstream gmetad need access to the interactive port, there should really be no reason for this port to be exposed to anything else both inside or outside a firewall. Brad -- Apps built with the Adobe(R) Flex(R) framework and Flex Builder(TM) are powering Web 2.0 with engaging, cross-platform capabilities. Quickly and easily build your RIAs with Flex Builder, the Eclipse(TM)based development software that enables intelligent coding and step-through debugging. Download the free 60 day trial. http://p.sf.net/sfu/www-adobe-com ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] [ANNOUNCEMENT] - Release Ganglia 3.1.2
The Ganglia Project (http://ganglia.info) is pleased to announce the official release of Ganglia 3.1.2 The official tarball is available for immediate download at: http://sourceforge.net/project/showfiles.php?group_id=43021package_id=35280release_id=661845 For a full description of the bug fixes and enhancements that are included in the 3.1.2 release as well as upgrade information, please see the current release notes at: http://ganglia.wiki.sourceforge.net/ganglia_release_notes Supported platforms: * Linux (Fedora/RedHat/CentOS, Debian, Gentoo, SuSE/OpenSuSE) * [Open]Solaris * FreeBSD * NetBSD * OpenBSD * DragonflyBSD * Cygwin (no support for DSO yet) * AIX (no support for DSO yet) Please read all the README, INSTALL and other available documentation (http://ganglia.wiki.sourceforge.net) as a lot of things have changed since version 3.0. Use good deployment practices when upgrading from 3.0.x to make sure that you do not mix gmond 3.0 and 3.1 nodes in the same cluster (as defined by a multicast address or unicast collector node). The protocol that allows gmond nodes to communicate within the same cluster, has changed. However the XML packets that are passed between gmond and gmetad have remained compatible from 3.0.x to 3.1.x, allowing a 3.1.x gmetad to continue to pull data from an older 3.0.x gmond cluster. Ganglia Development Team -- Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA -OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise -Strategies to boost innovation and cut costs with open source participation -Receive a $600 discount off the registration fee with the source code: SFAD http://p.sf.net/sfu/XcvMzF8H ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [ANNOUNCEMENT] Ganglia 3.1.2 testing tarball...
We are halfway through the testing period and I haven't heard any feedback on the list about how testing is going with 3.1.2. So the natural assumptions are that either nobody is testing the latest version or that testing is going so well that there is really nothing to report. I am hoping that it is the latter. :) If anybody has anything to report (good or bad), please send a quick email to the list. thanks, Brad On 1/30/2009 at 8:18 AM, in message 4982b7ef02ac0003a...@lucius.provo.novell.com, Brad Nicholes bnicho...@novell.com wrote: In an effort to continue improving the Ganglia software, the Ganglia Project has released an official testing release of Ganglia 3.1.2. The testing tarball is available for immediate download at: http://www.ganglia.info/testing/ The intent of this testing release of Ganglia 3.1.2 is to validate that the source code is stable and that the bug fixes and enhancements that have been added since the previous release of the software, are ready for general release. The release procedure from this point has been documented on the Ganglia wiki site at http://ganglia.wiki.sourceforge.net/ganglia_works under the heading Generating a Release Candidate and GA Release. Basically the Ganglia 3.1.2 testing tarball has been rolled and made available for testing by the Ganglia community. All bugs found in this testing release should be immediately reported through bugzilla (http://bugzilla.ganglia.info) and can be posted to the ganglia-develop...@lists.sourceforge.net mailing list as well. If the bug report is also accompanied by a bug fix patch, this will help avoid delays in producing new testing tarballs and ultimately an official general release of the software. If any critical level bugs are discovered, the current testing release tarball will be thrown away and a new tarball will be rolled and made available for further testing. Once a testing release tarball has been validated by the Ganglia community to be stable and ready for general availability, that tarball will become the official Ganglia 3.1.x release. So basically the sooner we are able to test and validate the Ganglia 3.1 source code, the sooner the project will be able to create an official release. But we need your help to get this done. Any and all testing and feedback, positive or negative, will be greatly appreciated. There will be a two week testing period for this 3.1.2 tarball which begins from the date of this announcement. So please help us to make sure that the tarball is valid and stable by building and installing it on any size of testing environment. Known issues with this testing release will be addressed on the Ganglia wiki site at: http://ganglia.wiki.sourceforge.net/Testing_Release_Notes For those who are interested in upgrading from a current 3.0.x installation, please see the current release notes at: http://ganglia.wiki.sourceforge.net/ganglia_release_notes Supported platforms (additional testing requested): * Linux (Fedora/RedHat/CentOS, Debian, Gentoo, SuSE/OpenSuSE) * [Open]Solaris * FreeBSD * NetBSD * OpenBSD * DragonflyBSD * Cygwin (no support for DSO yet) * AIX (no support for DSO yet) Please read all the README, INSTALL and other available documentation (http://ganglia.wiki.sourceforge.net) as a lot of things have changed since 3.0.7. Use good deployment practices when upgrading from 3.0.x to make sure that you do not mix gmond 3.0 and 3.1 nodes in the same cluster (as defined by a multicast address or unicast collector node). The protocol that allows gmond nodes to communicate within the same cluster, has changed. However the XML packets that are passed between gmond and gmetad have remained compatible from 3.0.x to 3.1.x, allowing a 3.0.x gmetad to continue to pull data from a newer 3.1.x gmond cluster. happy testing -- This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Create and Deploy Rich Internet Apps outside the browser with Adobe(R)AIR(TM) software. With Adobe AIR, Ajax developers can use existing skills and code to build responsive, highly engaging applications that combine the power of local resources and data with the reach of the web. Download the Adobe AIR SDK and Ajax docs to start building applications today-http://p.sf.net/sfu/adobe-com ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https
Re: [Ganglia-general] ganglia upgrade
On 1/20/2009 at 10:48 AM, in message 32128a4489900844a3dba3e8273be22e1348965...@in01wxmbx1.internal.synopsys.com, Hardik Shah hardik.s...@synopsys.com wrote: Hi, Does anyone has any information on upgrade on ganglia cluster? I have configured around 200 machines with ganglia 3.0.7 but now I want to upgrade the cluster to 3.1 version which is having more features. Any suggestion would be appreciated. -Hardik Check out the release notes at: http://ganglia.wiki.sourceforge.net/ganglia_release_notes -- This SF.net email is sponsored by: SourcForge Community SourceForge wants to tell your story. http://p.sf.net/sfu/sf-spreadtheword ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] custom metric's value doesn't update --custom python metric modules on 3.1.1
I haven't tried to actually run your module yet, but can this be a permissions problem. What user are you running gmond as? Does that user have permissions to run rndc and access named.stats? All modules run by gmond will be run as the same user as gmond. Therefore you have to make sure that the user that gmond is running as, has sufficient permissions to perform any work being done by the modules. Brad On 12/12/2008 at 4:24 AM, in message dcccdf790812120324j79fbd1e1y14ebf50f6313f...@mail.gmail.com, David Birdsong david.birds...@gmail.com wrote: I'm having the same exact problem. I can run the if __name__ == '__main__' test and have the script define and execute the callbacks via the descriptor list, but when running from gmond, all that gets called is the metric_init() I put an open in the init and set the open filename to global as a test because I thought maybe I couldn't rely on STDOUT from inside the hanlder. Only the metric_init writes to it depite my putting a file write in the handler. Any clue on why it's gmond's initializing but not calling the handler in the module? Heres' my pyconf: modules { module { name = dns language = python } } collection_group { collect_every = 20 time_threshold = 45 metric { name = zDNS_success title = Temperature value_threshold = 10 } metric { name = zDNS_nxrrset title = Temperature value_threshold = 10 } metric { name = zDNS_referral title = Temperature value_threshold = 10 } metric { name = zDNS_nxdomain title = Temperature value_threshold = 10 } metric { name = zDNS_recursion title = Temperature value_threshold = 10 } metric { name = zDNS_failure title = Temperature value_threshold = 10 } metric { name = zDNS_duplicate title = Temperature value_threshold = 10 } metric { name = zDNS_dropped title = Temperature value_threshold = 10 } } ## Here's my python module: #!/usr/bin/python import re import sys import time import subprocess NamePrefix = 'zDNS' LastRecord = {} MaxAge = 30 descriptors = [] out_file = '' def return_collection_data(collection_string): global NamePrefix collection_date_re = re.compile(r'^.*\(([0-9]+)\)$') collection = filter(lambda x: x, collection_string.split('\n')) collection_date = collection_date_re.search(collection.pop()).group(1) collection_map = {} for record in collection: name, value = record.split() name = '%s_%s' % (NamePrefix, name) collection_map[name] = int(value) return (int(collection_date), collection_map) def return_latest_record(): collections_splitter_re = re.compile(r'^\+\+\+\sStat.*Dump\s\+\+\+\s\([0-9]+\)$', re.MULTILINE) cmd = ['/usr/sbin/rndc', 'stats'] x = subprocess.Popen(cmd) r_code = x.wait() if r_code != 0 : print sys.stderr, 'bad things happened calling %s ' % cmd print sys.stderr, 'need to throw and catch an exception' f = '/var/named/named.stats' collections = collections_splitter_re.split(open(f).read()) collections = map(lambda x: x.strip(), collections) foo = {} for collection in collections: collection = collection.strip() if not collection: continue collection_date, collection_data = return_collection_data(collection) foo[collection_date] = collection_data k = foo.keys() k.sort() return (time.time(), foo[k.pop()]) def named_stats_handler(value): global LastRecord global MaxAge global out_file out_file.write('in handler name %s \n' % value) timestamp, data = LastRecord if time.time() - timestamp MaxAge: timestamp, data = return_latest_record() LastRecord = (timestamp, data) out_file.write('Current Record %s\n' % LastRecord.__str__()) out_file.flush() return data[value] def metric_init(params): global descriptors global NamePrefix global LastRecord global out_file out_file = open('/tmp/ganglia_david', 'w') LastRecord = return_latest_record() out_file.write('LastRecord %s\n' % LastRecord.__str__()) out_file.flush() descriptors = [ { 'name' : '%s_success' % NamePrefix, 'call_back': named_stats_handler, 'time_max': 60, 'value_type': 'uint32', 'units': 'ticks', 'slope': 'positive', 'format': '%u', 'description': 'Successful Queries', 'groups': NamePrefix }, { 'name' : '%s_nxrrset' % NamePrefix, 'call_back': named_stats_handler, 'time_max': 60, 'value_type': 'uint32', 'units': 'ticks', 'slope': 'positive', 'format': '%u', 'description': 'Dunno', 'groups': NamePrefix }, { 'name' : '%s_referral' % NamePrefix, 'call_back': named_stats_handler,
Re: [Ganglia-general] Monitor Apache
On 12/11/2008 at 11:33 AM, in message 49415cf1.1010...@greenberg.org, Ed Greenberg e...@greenberg.org wrote: Michael Henderson wrote: Hello all, Is there a way to monitor apache through ganglia? Thanks, ~Mike I'm interested in seeing what others say but... I rolled my own as follows: 1. I turned on server-status and ExtendedStatus 2. I wrote a script to retrieve http://localhost/server-status?auto which returns this: Total Accesses: 123048 Total kBytes: 1445888 CPULoad: .270592 Uptime: 20688 ReqPerSec: 5.9478 BytesPerSec: 71567.5 BytesPerReq: 12032.6 BusyWorkers: 4 IdleWorkers: 71 Scoreboard: I wrote a script to use gmetric to pump the values above (those I want) into ganglia. (Available on request.) We did the same thing except rather than use gmetric, we implemented it as a python module. It was very simple to do, it is basically just a command line call to get the apache status output and then parse the output into the various metrics. Brad -- SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada. The future of the web can't happen without you. Join us at MIX09 to help pave the way to the Next Web now. Learn more and register at http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] Spoofing functionality in 3.1.x branch...
For those that are interested in the module based spoofing feature, all of the functionality should be complete and has been backported to the 3.1.x branch. I have also added some spoofing module examples to trunk that can be downloaded from monitor-core/gmond/python_modules/example/spfexample.py in the trunk repository. There is also a small .pyconf file in monitor-core/gmond/python_modules/conf.d/spfexample.pyconf. This example module should give you enough guidance so that you can build your own spoofing module. Please let me know if anything is missing. Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] Testing BETA 3.1.x available...
There is a new BETA tarball and RPMs on the Ganglia testing site (http://www.ganglia.info/testing/). The following includes a list of enhancements and bug fixes that are currently available in this testing BETA release. * gmond/gmetad: Sync-up the default values for the cluster section of gmond with the default gmond.conf so that a cluster name will always be present. The gmetad code can not handle a host with no associated cluster, therefore the gmond code must always include a cluster XML tag. Bug #200 * gmond: Add an 'enabled' directive to the module section so that a module can easily be enabled or disabled through the configuration file * gmond: reformat memory metrics to match pre 3.1 style * gmond: -r support for transforming 2.5 configurations * gmond: add boolean option to 'allow_extra_data' generation (BUG199) * gmond: include localhost in translated (-r) trusted_hosts from 2.5 * gmetad: skip unresponsive sources (BUG92) * libganglia: mcast_if support in gmond (BUG140) * web: add boolean option for using hostname without domainname for graphs * web: add host atrributes into metric list (BUG30) * web: metric group enhancements for host view (BUG203) * web: add option for configurable number of columns in cluster view (BUG194) * web: make number of metric columns in host view configurable (BUG194) * Allow both a C and python module to create a metric that will spoof a specific host. This provides the same spoofing functionality as gmetric but through a metric module. It is done by adding SPOOF_HOST and SPOOF_NAME as extra metadata to the metric description * gmond: mod_python support for versions older than 2.3 or newer than 2.4 * mod_python: Change the way that the python module path is added to better support the Solaris platform. It is also a cleaner way to add the python path programmatically rather than altering the PYTHON_PATH environment variable. * gmetric: Support the short commandline parameter format when spoofing a heartbeat metric. * Bug fixes and Enhancements This is just a testing BETA release that is not yet ready for production use. Please test this code and let us know if we have missed or broken anything. Thanks, Brad -- SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada. The future of the web can't happen without you. Join us at MIX09 to help pave the way to the Next Web now. Learn more and register at http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] nonresponsive gmond
On 11/29/2008 at 11:54 AM, in message [EMAIL PROTECTED], Kostas Georgiou [EMAIL PROTECTED] wrote: On Tue, Nov 04, 2008 at 10:02:32AM -0700, Brad Nicholes wrote: On 11/3/2008 at 5:27 PM, in message [EMAIL PROTECTED], Kostas Georgiou [EMAIL PROTECTED] wrote: On Mon, Nov 03, 2008 at 11:46:52PM +, Kostas Georgiou wrote: On Mon, Nov 03, 2008 at 01:55:22PM -0700, Brad Nicholes wrote: If a timeout is set, then is the resulting XML output still good or did we lose something because of the timeout? No, it seems to be working fine. I am testing with: Actually I was wrong there was enough data in the socket buffers to confuse me. The xml output is truncated in the slow reader :( Attached is a patch against trunk which implements a lingering close. I am not sure if this will solve the problem but Apache does a similar thing to make sure that both sides get a chance to complete the conversation before closing the socket. Apply this patch, let it run for a while and let's see if this solves the problem. I just got a non responsive gmond and after looking at the network traces it seems that: gmond tries to write to what it thinks is a still alive connection so it is blocked there. On the gmetad side there is no such connection so the firewall replies with ICMP host foo unreachable - admin prohibited. Unfortunately this doesn't cause the connection to be dropped on the gmond side (will anything else than RST work at this point?) and gmond keeps trying.. At this point it's too late to tell why the connection wasn't closed properly (was the FIN packet lost somehow?) but using a short keepalive setting in the gmond side can not hurt and will help in cases like this one. This is the real question. Who initiated the close and why? This would be much easier to debug if we could somehow figure out how to reproduce the problem reliably. Since I am not able to reproduce this problem, I am wondering if it might have something to do with the OS or version of the socket library the OS is using. We have heard of this happening on CentOS and Solaris. Is there anything in common about the socket libraries between these two OSs? I guess the band aid would be to put in a timeout and abort on gmond's write function. Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetric fails when disk is unwriteable?
On 11/26/2008 at 3:45 AM, in message [EMAIL PROTECTED], Martin Knoblauch [EMAIL PROTECTED] wrote: - Original Message From: Brad Nicholes [EMAIL PROTECTED] To: Ofer Inbar [EMAIL PROTECTED] Cc: ganglia-general@lists.sourceforge.net Sent: Tuesday, November 25, 2008 8:43:08 PM Subject: Re: [Ganglia-general] gmetric fails when disk is unwriteable? On 11/25/2008 at 10:14 AM, in message [EMAIL PROTECTED], Ofer Inbar wrote: Brad Nicholes wrote: It needs a temp directory to get around some issues with libconfuse. Libconfuse doesn't actually support wildcard paths or files. A libconfuse include statement must have a full path to the file that it is going include. So gmond makes up for this problem by creating a temp file, resolving all of the file paths and names and then writing them as separate includes in a temp file. Then it tells libconfuse to include the temp file directly. Without the ability to resolve the wildcard paths and write them to a temp file, the wildcarding feature of gmond wouldn't work. To solve the problem that you are describing, we would have to actually add wildcard capability to libconfuse. Might this be cleaner workaround that would work for gmond as well? - override libconfuse's include function as you're already doing - resolve file paths and names as you're already doing - instead of writing that to a temp file and telling libconfuse to include that file, just tell libconfuse to include each individual file (the same filenames you're now writing to the temp file) No, libconfuse doesn't work that way. The include handler can only manipulate the file path that it is handed. So the result of the handler has to be a single absolute file path. There isn't any way to take a single file path as input into the handler and return multiple file paths back to libconfuse. The only way to do it was to write all of the individual file paths to a file and then hand libconfuse back a single file path to the new include file. the question is: can't the handler be rewritten to the conversions in memory, without needing to write a temp file? This would make the process more robust. You never know when a disk is full, or goes RO. No, I tried doing that already but was unsuccessful. Libconfuse is limited in what you can do in this area. The problem is that when libconfuse wants to read in the include file, it is in the middle of the lexer and needs to continue. A handler can't just read the file and hand it back to libconfuse through some other cfg_* call. This may be a design flaw in libconfuse but it is the way it works now and we have to live with it. Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetric fails when disk is unwriteable?
On 11/26/2008 at 1:17 AM, in message [EMAIL PROTECTED], Carlo Marcelo Arenas Belon [EMAIL PROTECTED] wrote: On Tue, Nov 25, 2008 at 04:33:05PM -0700, Brad Nicholes wrote: The result was that if the wildcard produced more than 10 included files (which it easily does even in our default configuration), libconfuse choked because it thought it had hit the maximum nesting level our RPMs for ganglia only install 3 files in /etc/ganglia/conf.d; gentoo has 2 and fedora 10 (just released) has 4. even if I agree that 10 is somehow low and you would expect that as more modules are deployed it will be soon problematic, it would seem that at least in this case, one problem was traded for another. The fact is that 10 is low which is why I discovered that last year when I implemented the wildcard path support. In our case we routinely run with 20+ modules and configure them using separately included .conf files so that each one can be easily turned on or off by simply renaming the included .conf file. This is a very valuable feature which isn't unique to ganglia. Limiting this very useful feature now in gmond on the remote chance that a file system might go read only and cause an issue for gmetric, isn't a very good trade off. It isn't that one problem was traded for another. At the time when I implemented the code to support wildcard paths, nobody knew anything about gmetric not being able to run in a read only file system. There was no trade off begin made. The fact is that whether or not gmetric is able to run in a read only file system is a much smaller issue than allowing gmond or gmetric to run in an undetermined state because the code allows parts of the configuration to be ignored. Introducing a patch that knowingly ignores parts of the configuration due to errors in the file system is unacceptable behavior. The bug that this kind of patch introduces is much larger than an issue with gmetric not being able to run in a read only environment. Gmond being able to resolve wildcard paths is a standard feature and behavior that is used every day, gmetric being able to run in a read only file system is not. The real issue is why did the disk go read only. There are plenty of gmond metrics that provide the administrator with warnings and a metric module that indicates when a file system has gone read only is extremely easy to write. A more acceptable solution to the gmetric problem is to provide gmetric with its own .conf file that contains just the socket and port information rather than pointing gmetric at gmond.conf. In this case both gmond and gmetric will continue to run even in a read only file system. This solution can be easily implemented today without any code changes and especially without a code patch that introduces a much more serious bug. If we need to solve the gmetric being able to run in a read only file system, then we need to come up with a better patch. Crippling gmond and gmetric with a patch that allows them to ignore a fatal error because parts of the configuration was skipped, is not an acceptable patch. Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetric fails when disk is unwriteable?
On 11/25/2008 at 1:08 AM, in message [EMAIL PROTECTED], Carlo Marcelo Arenas Belon [EMAIL PROTECTED] wrote: On Mon, Nov 24, 2008 at 04:55:42PM -0700, Brad Nicholes wrote: On 11/24/2008 at 3:47 PM, in message [EMAIL PROTECTED], Ofer Inbar [EMAIL PROTECTED] wrote: I tried feeding one of my custom metrics by hand: [root ~]$ gmetric --name net_smtp_fin_wait2_out --value 0 --type uint8 --units 'connections' /etc/ganglia/gmond.conf:94: failed to determine the temp dir Parse error for '/etc/ganglia/gmond.conf' It needs a temp directory to get around some issues with libconfuse. gmond does; gmetric doesn't need anything more than to know which channel to use (hence nothing in the includes) and it is getting blocked by this restriction because of its use of libganglia to read gmond's configuration through libgmond. Anything can be included from the main gmond.conf file. There is nothing that says that a user can't put socket and channel information in a separate file and then include it from gmond.conf. So making the assumption that gmetric doesn't need includes is false. If this is a real problem for users, then gmetric should be using a different .conf file that only contains the socket information rather than using the same gmond.conf file that contains all of the metric information and includes. Also, both gmond and gmetric both use the same code path for resolving the configuration, so if the code is changed to ignore configuration failures for gmetric, it is also changed to ignore configuration failures for gmond. This isn't a good thing. This problem doesn't require a code change to be resolved. Simple documentation for gmetric would solve the problem. To solve the problem that you are describing, we would have to actually add wildcard capability to libconfuse. libconfuse is instructed to use our implementation for includes and that uses a temporary file, so this is fixable in our code. a fix to the problem reported by Ofer only needs our handler modified so that failures to create temporary files to handle includes are not treated as fatal as Committed revision 1922 No, libconfuse doesn't work that way. The include handler only allows gmond to manipulate the input into a form that libconfuse can handle. In this case the input is a single wildcard file path that needs to be translated into a single absolute file path. libconfuse can not handle wild card paths. Also libconfuse only knows how to get its input from a file. The gmond include handler is only manipulating the wildcard path into an absolute path to a file that contains all of the resolved paths. At that point libconfuse is able to read and process all of the included files through absolute paths. The include handler has nothing to do with just translating a single wildcard path into multiple absolute paths and then handing them back to libconfuse in memory. These include paths have to be written to a file first and then libconfuse has to be told where the new file is. This problem can't be fixed by just changing the include handler, otherwise I would have done it that way. Revision 1922 currently breaks the configuration file handling and needs to be reverted. Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetric fails when disk is unwriteable?
On 11/25/2008 at 10:14 AM, in message [EMAIL PROTECTED], Ofer Inbar [EMAIL PROTECTED] wrote: Brad Nicholes [EMAIL PROTECTED] wrote: It needs a temp directory to get around some issues with libconfuse. Libconfuse doesn't actually support wildcard paths or files. A libconfuse include statement must have a full path to the file that it is going include. So gmond makes up for this problem by creating a temp file, resolving all of the file paths and names and then writing them as separate includes in a temp file. Then it tells libconfuse to include the temp file directly. Without the ability to resolve the wildcard paths and write them to a temp file, the wildcarding feature of gmond wouldn't work. To solve the problem that you are describing, we would have to actually add wildcard capability to libconfuse. Might this be cleaner workaround that would work for gmond as well? - override libconfuse's include function as you're already doing - resolve file paths and names as you're already doing - instead of writing that to a temp file and telling libconfuse to include that file, just tell libconfuse to include each individual file (the same filenames you're now writing to the temp file) No, libconfuse doesn't work that way. The include handler can only manipulate the file path that it is handed. So the result of the handler has to be a single absolute file path. There isn't any way to take a single file path as input into the handler and return multiple file paths back to libconfuse. The only way to do it was to write all of the individual file paths to a file and then hand libconfuse back a single file path to the new include file. Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetric fails when disk is unwriteable?
On 11/25/2008 at 10:14 AM, in message [EMAIL PROTECTED], Ofer Inbar [EMAIL PROTECTED] wrote: Brad Nicholes [EMAIL PROTECTED] wrote: It needs a temp directory to get around some issues with libconfuse. Libconfuse doesn't actually support wildcard paths or files. A libconfuse include statement must have a full path to the file that it is going include. So gmond makes up for this problem by creating a temp file, resolving all of the file paths and names and then writing them as separate includes in a temp file. Then it tells libconfuse to include the temp file directly. Without the ability to resolve the wildcard paths and write them to a temp file, the wildcarding feature of gmond wouldn't work. To solve the problem that you are describing, we would have to actually add wildcard capability to libconfuse. Might this be cleaner workaround that would work for gmond as well? - override libconfuse's include function as you're already doing - resolve file paths and names as you're already doing - instead of writing that to a temp file and telling libconfuse to include that file, just tell libconfuse to include each individual file (the same filenames you're now writing to the temp file) At one point I had tried to do exactly what is being suggested here. See revision http://ganglia.svn.sourceforge.net/viewvc/ganglia?view=revrevision=813 The problem that I ran into was that libconfuse thought that each call to cfg_include() meant that the include was nested deeper rather than at the same level. The result was that if the wildcard produced more than 10 included files (which it easily does even in our default configuration), libconfuse choked because it thought it had hit the maximum nesting level even through we were still at a nesting level of one. Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetric fails when disk is unwriteable?
On 11/21/2008 at 9:33 PM, in message [EMAIL PROTECTED], Ofer Inbar [EMAIL PROTECTED] wrote: One of our servers encountered an I/O error that put its root filesystem into read only mode. Both /var and /tmp are on that filesystem, so all logging stopped and most everything stopped. However, gmond kept on running, and reporting metrics. Great! This is yet another way in which Ganglia wins over most other monitoring systems that involve scripts that write things to disk or otherwise depend on things (such as ssh logins) that need to write to disk. However, a program I have that feeds custom metrics to gmond via gmetric stopped working when the / filesystem went read-only. I tried running it in debug mode, and got this error: /etc/ganglia/gmond.conf:94: failed to determine the temp dir Parse error for '/etc/ganglia/gmond.conf' Line 94 of gmond.conf is: include ('/etc/ganglia/conf.d/*.conf') We've never had an /etc/ganglia/conf.d directory, it always ignores that. I tried feeding one of my custom metrics by hand: [root ~]$ gmetric --name net_smtp_fin_wait2_out --value 0 --type uint8 --units 'connections' /etc/ganglia/gmond.conf:94: failed to determine the temp dir Parse error for '/etc/ganglia/gmond.conf' Then, I cd'ed over to a filesystem that is still in read/write mode: [root /otherfilesys]$ gmetric --name net_smtp_fin_wait2_out --value 0 --type uint8 --units 'connections' No error, and it worked. What's the dependency that causes gmetric to require that the filesystem the CWD is on be writeable? Does it really need that dependency? It's great that Ganglia is so robust in the face of failures, but it'd be even better if gmetric were also as robust. -- Cos Both gmetric and gmond read the same .conf file. If the .conf file has an include() statement that specifies a wildcard file path, processing the wildcard path requires a temp directory. If you aren't loading any files from the wildcard include path (ie. /etc/gmond/conf.d/*) then just remove the include statement from the .conf file and everything should work fine in a readonly environment. The reason why gmond kept running but you had problems with gmetric is because gmond had already processed the wildcard path before the filesystem switched to readonly. Every time gmetric starts, it needs to re-read the .conf and process the wildcard path. Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetric fails when disk is unwriteable?
On 11/24/2008 at 3:47 PM, in message [EMAIL PROTECTED], Ofer Inbar [EMAIL PROTECTED] wrote: I tried feeding one of my custom metrics by hand: [root ~]$ gmetric --name net_smtp_fin_wait2_out --value 0 --type uint8 --units 'connections' /etc/ganglia/gmond.conf:94: failed to determine the temp dir Parse error for '/etc/ganglia/gmond.conf' Then, I cd'ed over to a filesystem that is still in read/write mode: [root /otherfilesys]$ gmetric --name net_smtp_fin_wait2_out --value 0 --type uint8 --units 'connections' No error, and it worked. What's the dependency that causes gmetric to require that the filesystem the CWD is on be writeable? Does it really need that dependency? It's great that Ganglia is so robust in the face of failures, but it'd be even better if gmetric were also as robust. Someone wrote me to suggest running it with strace, which is an obvious thing to do but unfortunately I didn't think of it at the time of the failure (it was late at night). However, Brad knows the answer: Brad Nicholes [EMAIL PROTECTED] wrote: Both gmetric and gmond read the same .conf file. If the .conf file has an include() statement that specifies a wildcard file path, processing the wildcard path requires a temp directory. If you Removing the wildcard doesn't seem ideal, since it's something one might want to use and it's part of the standard config, so removing it and then forgetting that seems like a likely cause of confusion. Also, most people would never think to investigate something that's in the supplied conf file and doesn't seem to cause harm. If we want robustness in the face of failure, having gmetric and gmond able to run without having to write to disk sounds like a better goal. Is it doable? Why does it need to write to a temp directory to process a wildcard? Are there any other parts of gmond or gmetric that depend on being able to write to disk? It seems that both of these programs should be able to avoid writing to disk entirely (except for swap/paging space on a memory-starved host). -- Cos It needs a temp directory to get around some issues with libconfuse. Libconfuse doesn't actually support wildcard paths or files. A libconfuse include statement must have a full path to the file that it is going include. So gmond makes up for this problem by creating a temp file, resolving all of the file paths and names and then writing them as separate includes in a temp file. Then it tells libconfuse to include the temp file directly. Without the ability to resolve the wildcard paths and write them to a temp file, the wildcarding feature of gmond wouldn't work. To solve the problem that you are describing, we would have to actually add wildcard capability to libconfuse. Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] High system load when gmond is running
On 11/13/2008 at 4:08 PM, in message [EMAIL PROTECTED], [EMAIL PROTECTED] wrote: I looked into it further and it looks like my problem isn't gmond its gmetad. If I just have gmond running without gmetad the system load is normal but as soon as I start gmetad the load starts to go up. I ran valgrind gmetad -d 1 and it looks like I have a memory leak in gmetad. Does anyone know what I could try to fix the memory leak with gmetad? -Peter ==19924== Memcheck, a memory error detector. ==19924== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al. ==19924== Using LibVEX rev 1658, a library for dynamic binary translation. ==19924== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP. ==19924== Using valgrind-3.2.1, a dynamic binary instrumentation framework. ==19924== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al. ==19924== For more details, rerun with: -v ==19924== Sources are ... Source: [my cluster, step 15] has 1 sources 192.168.1.254 tcp_listen() on xml_port failed: Address already in use ==19924== ==19924== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 4 from 1) ==19924== malloc/free: in use at exit: 10,118 bytes in 101 blocks. ==19924== malloc/free: 170 allocs, 69 frees, 20,163 bytes allocated. ==19924== For counts of detected errors, rerun with: -v ==19924== searching for pointers to 101 not-freed blocks. ==19924== checked 208,416 bytes. ==19924== ==19924== LEAK SUMMARY: ==19924== definitely lost: 465 bytes in 4 blocks. ==19924== possibly lost: 0 bytes in 0 blocks. ==19924== still reachable: 9,653 bytes in 97 blocks. ==19924== suppressed: 0 bytes in 0 blocks. ==19924== Use --leak-check=full to see details of leaked memory. I guess the more obvious question is, why are you getting: tcp_listen() on xml_port failed: Address already in use What else is listening on the ports that gmetad is trying to open? These ports by default are 8651 and 8652. Do you have another gmetad running on the same box or is some other service configured to listen on these same ports? Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] cluster graphing stops entirely after mastergmond restart
On 11/10/2008 at 6:11 PM, in message [EMAIL PROTECTED], Ofer Inbar [EMAIL PROTECTED] wrote: Brad Nicholes [EMAIL PROTECTED] wrote: The reason why is because with the introduction of the modular metric functionality, metric metadata is now passed between gmonds rather than it being hardcoded into every gmond. In multicast mode, if you restart the master gmond, it has to request from and wait for each sub-gmond that is listening on the same multicast channel, to respond with its metadata for each metric it supports. Depending on the reporting interval for a collection group, this could take anywhere from a few seconds to several minutes. In unicast mode the However, as we discovered a couple of months ago, requesting new metadata doesn't always work as designed, and it can take hours to get everything it needs. Restarting the *other* gmonds in the cluster is sort of a workaround, because each one will send its metadata when it is restarted. This way, you can at least ensure that the least recently restarted gmond knows everything. -- Cos That's interesting. I would like to investigate this further since I haven't seen the same problem. In my testing, granted I don't have very large clusters to test with, the complete metadata resync time has only taken as long as the long longest collection_group interval (ie. time_threshold value). If you don't have any collection_groups that have a time_threshold on the order of hours, then there is something we need to investigate further. It will just be a little more difficult because I can't duplicate the delay in my testing environment. BTW, another workaround if you don't mind the additional UDP traffic, is to set the send_metadata_interval value anyway, even in multicast mode. All it will do is ensure that the metadata is sent on an interval rather than just on request. This might be a good idea if you are restarted gmond's very often. Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] cluster graphing stops entirely after mastergmond restart
On 11/10/2008 at 3:26 PM, in message [EMAIL PROTECTED], Brad Fino [EMAIL PROTECTED] wrote: If I restart gmond on the master node that a cluster reports to, the entire cluster stops graphing entirely. Some nodes in the cluster start graphing immediately after a node gmond restart, and some do not. Some graph partial statistics. It usually takes 2-3 restarts to get the entire cluster graphing properly again. Even if I leave the nodes be for 5,10,30 minutes they don't start graphing again until a gmond restart on the node. I don't remember this behavior in pre 3.1.0 gmond / gmetad. In 3.0.6 if I restarted a master gmond then the cluster would just pick right up again; here it just flat stops graphing. The nodes aren't reported as being offline. The old stats and metrics are still all there. It just stops graphing new data. The reason why is because with the introduction of the modular metric functionality, metric metadata is now passed between gmonds rather than it being hardcoded into every gmond. In multicast mode, if you restart the master gmond, it has to request from and wait for each sub-gmond that is listening on the same multicast channel, to respond with its metadata for each metric it supports. Depending on the reporting interval for a collection group, this could take anywhere from a few seconds to several minutes. In unicast mode the global directive send_metadata_interval must be set to something greater than 0. The value of this directive is the interval in second at which gmond will send its metric metadata to the master gmond. Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] nonresponsive gmond
On 11/3/2008 at 5:27 PM, in message [EMAIL PROTECTED], Kostas Georgiou [EMAIL PROTECTED] wrote: On Mon, Nov 03, 2008 at 11:46:52PM +, Kostas Georgiou wrote: On Mon, Nov 03, 2008 at 01:55:22PM -0700, Brad Nicholes wrote: If a timeout is set, then is the resulting XML output still good or did we lose something because of the timeout? No, it seems to be working fine. I am testing with: Actually I was wrong there was enough data in the socket buffers to confuse me. The xml output is truncated in the slow reader :( Attached is a patch against trunk which implements a lingering close. I am not sure if this will solve the problem but Apache does a similar thing to make sure that both sides get a chance to complete the conversation before closing the socket. Apply this patch, let it run for a while and let's see if this solves the problem. Brad Index: gmond.c === --- gmond.c (revision 1883) +++ gmond.c (working copy) @@ -1498,6 +1498,76 @@ return apr_socket_send(client, /HOST\n, len); } +/* we now proceed to read from the client until we get EOF, or until + * MAX_SECS_TO_LINGER has passed. the reasons for doing this are + * documented in a draft: + * + * http://www.ics.uci.edu/pub/ietf/http/draft-ietf-http-connection-00.txt + * + * in a nutshell -- if we don't make this effort we risk causing + * TCP RST packets to be sent which can tear down a connection before + * all the response data has been sent to the client. + */ +#define MAX_SECS_TO_LINGER 30 +#define SECONDS_TO_LINGER 2 +void lingering_close(apr_socket_t *csd) +{ +char dummybuf[512]; +apr_size_t nbytes; +apr_time_t timeup = 0; + +if (!csd) { +return; +} + +#ifdef NO_LINGCLOSE +ap_flush_conn(c); /* just close it */ +apr_socket_close(csd); +return; +#endif + +/* Close the connection, being careful to send out whatever is still + * in our buffers. If possible, try to avoid a hard close until the + * client has ACKed our FIN and/or has stopped sending us data. + */ + +/* Shut down the socket for write, which will send a FIN + * to the peer. + */ +if (apr_socket_shutdown(csd, APR_SHUTDOWN_WRITE) != APR_SUCCESS) +{ +apr_socket_close(csd); +return; +} + +/* Read available data from the client whilst it continues sending + * it, for a maximum time of MAX_SECS_TO_LINGER. If the client + * does not send any data within 2 seconds (a value pulled from + * Apache 1.3 which seems to work well), give up. + */ +apr_socket_timeout_set(csd, apr_time_from_sec(SECONDS_TO_LINGER)); +apr_socket_opt_set(csd, APR_INCOMPLETE_READ, 1); + +/* The common path here is that the initial apr_socket_recv() call + * will return 0 bytes read; so that case must avoid the expensive + * apr_time_now() call and time arithmetic. */ + +do { +nbytes = sizeof(dummybuf); +if (apr_socket_recv(csd, dummybuf, nbytes) || nbytes == 0) +break; + +if (timeup == 0) { +/* First time through; calculate now + 30 seconds. */ +timeup = apr_time_now() + apr_time_from_sec(MAX_SECS_TO_LINGER); +continue; +} +} while (apr_time_now() timeup); + +apr_socket_close(csd); +return; +} + static void process_tcp_accept_channel(const apr_pollfd_t *desc, apr_time_t now) { @@ -1584,8 +1654,9 @@ /* Close down the accepted socket */ close_accept_socket: - apr_socket_shutdown(client, APR_SHUTDOWN_READ); - apr_socket_close(client); + lingering_close(client); + //apr_socket_shutdown(client, APR_SHUTDOWN_READ); + //apr_socket_close(client); apr_pool_destroy(client_context); } - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] ganglia vs. top: running processes
On 10/24/2008 at 7:57 AM, in message [EMAIL PROTECTED], Ofer Inbar [EMAIL PROTECTED] wrote: Recently we noticed something we don't know the explanation for, on a CentOS4 for system running gmond 3.1.0: The Ganglia graph shows a line for running processes that sometimes spikes to 10, 20, or higher, and often stays at that high level for several samples in a row; but top reports that processes running is 1-4 and occasionally 5 or 6, but never shows a number as high as 10 even if we watch it for a while, while the graph updates and clearly shows spikes. We found one cause of those spikes and fixed it, reducing their size and number, so we don't think Ganglia is fabricating it. It's showing something that's really happening. But why the disagreement with top? One reason for this is probably because Ganglia specifically looks for spikes and then reports them. This is what the value_threshold directive in the .conf file is for. If the delta between the previous value and the current value ever spikes beyond a specified percentage, then gmond will immediately send the values for the collection_group that the metric belongs to. This ensures that unexpected spikes are not missed. I'm not sure what top is doing in this same case. It may be ignoring spikes and simply displaying an average. Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] About python module for gmond (multiple metricsper one single call back? dynamic pyconf?)
On 10/20/2008 at 11:19 PM, in message [EMAIL PROTECTED], utopia zh [EMAIL PROTECTED] wrote: Hi, I'm recently working on the gmond python mode. I found that for some metrics, it will be beneficial if we can return multiple metric values in the single callback. For example, if we want to get usage information about disks (total, used, free), we can get these values via a single statvfs call, but in order to send these metrics out using python module, we need to call statvfs for 3 times. There maybe some more time-consuming examples. For example, we may need to parse a xml to get some values, it would save a lot of efforts if we can fill the value of multiple metric at the same time. I thought that maybe OK for python module if we just return Value1:Values2:Value3, but in that case, we will not be able to get pictures from ganglia/rrdtools. The way to do this is to spawn a thread in your python module that gathers all of the metrics at once and then just caches them. Then when each of your metric handlers are called, they each simply return the cached value. Take a look at the tcpconn.py module. This is the way that it works. It calls netstat once to gather all of the various values. Then the different metric handler just read the current values from the cache. Another question about python module is that: for some dynamically changed metrics (e.g. we may need to handle adding/removing storage devices), could we add/remove metric entry in pyconf on the fly? I noticed that using gmond -m will be able to get all entries of metrics according to metric_init, is there any way to let gmond collect metrics according to metric entries given by metric_init of python script (instead of from conf.d/*.pyconf)? Or we can go still further to let gmond call metric_init every period of time? Any comments? Thanks. This is an issue that I have looked into a few times but haven't been able to come up with a solution without re-architecting the way that gmond creates its internal list of metrics. This would be a nice feature to have. The spoofing feature for module metrics has a dynamic element to it so there may be something that we can do to extend that functionality for general metrics. But we need to get the spoofing feature backported to 3.1.1 first which is still waiting for review and approval. Brad Regards, Hang - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] metric name and spoofed metrics
On 10/3/2008 at 12:23 PM, in message [EMAIL PROTECTED], Martin Hicks [EMAIL PROTECTED] wrote: (sorry if this is a duplicate. I sent it yesterday but I haven't seen it come back yet, nor has it shown up in the mailing list archives on sourceforge) Hi, I backported the spoofing patches to my 3.1.1 build (I also check this against trunk to make sure I hadn't missed something) in order to play with python DSO and spoofing. I ran into a question. If you have the same metric for a bunch of different hosts, just with different SPOOF_HOST, then how do you tell the difference in the call_back? You just get the 'name', which appears to always be the same. [SNIP] Am I expected to deal with each SPOOF_HOST when a call_back occurs for a particular metric name? I kind of expected SPOOF_NAME to be in the 'name' argument of the call_back. For the spoof modules that I wrote, I appended the host name to the name of the metric and used it as a unique metric identifier metric_name = name + ':' + host_name Then in the handler I just call metric_keys = name.split(':') and I end up with an array that uniquely identifies which metric for which host. Finally, to make sure that the concatenation of the metric_name:host_name does not show up in the front end as the title of the graph, I make sure to assign a title to the metric in the gmond.conf file. This is just one way to uniquely identify any spoofed metric. Since gmond has no idea what a metric module is doing or what naming convention is being used, it simply leaves all of that up to the module itself rather than trying to enforce some policy. Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] metric name and spoofed metrics
On 10/6/2008 at 2:23 PM, in message [EMAIL PROTECTED], Martin Hicks [EMAIL PROTECTED] wrote: On Mon, Oct 06, 2008 at 11:11:51AM -0600, Brad Nicholes wrote: On 10/3/2008 at 12:23 PM, in message Am I expected to deal with each SPOOF_HOST when a call_back occurs for a particular metric name? I kind of expected SPOOF_NAME to be in the 'name' argument of the call_back. For the spoof modules that I wrote, I appended the host name to the name of the metric and used it as a unique metric identifier metric_name = name + ':' + host_name Then in the handler I just call metric_keys = name.split(':') and I end up with an array that uniquely identifies which metric for which host. Finally, to make sure that the concatenation of the metric_name:host_name does not show up in the front end as the title of the graph, I make sure to assign a title to the metric in the gmond.conf file. This is just one way to uniquely identify any spoofed metric. Since gmond has no idea what a metric module is doing or what naming convention is being used, it simply leaves all of that up to the module itself rather than trying to enforce some policy. Does this not also mean I need to know at gmond start-up time which hosts the module will be spoofing for and rewrite /etc/ganglia/conf.d/module.pyconf to reflect the list of nodes that will need to be spoofed? No. Sorry, the colon separated name/id actually has more significance than I stated previously. The metric_name:host_name is actually a derivative of the same style used by gmetric in the --spoof=STRING parameter. Part of the patch to gmond.c was to add a function called get_metric_names() which for a spoofed metric, specifically looks for a colon separated metric name, pulls the first name from the string and uses that to match against the name that is referenced in the gmond.conf file. So for example if your metric name is my_foo_metric then the name that is referenced in gmond.conf should also be my_foo_metric. But the name that your module actually assigns as the metric name in the metric definition structure should be my_foo_metric:some_host_or_other_id. This needs to be documented in the README.in file that explains how to write a python metric module. I also plan on adding this same documentation to the wiki once the patch has been backported. Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] mod_python on Solaris not scanning directory
On 9/25/2008 at 6:08 AM, in message [EMAIL PROTECTED], Gilad Raphaelli [EMAIL PROTECTED] wrote: - Original Message From: Brad Nicholes [EMAIL PROTECTED] To: Lieting Yu [EMAIL PROTECTED]; ganglia-general@lists.sourceforge.net; Gilad Raphaelli [EMAIL PROTECTED] Sent: Thursday, September 25, 2008 1:05:11 AM Subject: Re: [Ganglia-general] mod_python on Solaris not scanning directory On 9/23/2008 at 7:03 PM, in message [EMAIL PROTECTED], Gilad Raphaelli wrote: Lieting, I believe I ran into the same issue and cleared it up with this patch to mod_python.c: --- mod_python.c.orig 2008-09-24 10:52:17.0 +1000 +++ mod_python.c2008-09-24 10:59:06.0 +1000 @@ -510,6 +510,11 @@ /* Set up the python path to be able to load module from our module path */ set_python_path(path); Py_Initialize(); + +PyObject *sys_path = PySys_GetObject(path); +PyObject *addpath = PyString_FromString(path); +PyList_Append(sys_path, addpath); + PyEval_InitThreads(); gtstate = PyEval_SaveThread(); This does the equivalent of a 'sys.path.append(path)' without any error handling. I haven't heard back from the developer list whether this is the right approach but perhaps it will get you going for now. The call to the function set_python_path(path) should have done the same thing by setting up the PYTHONPATH environment variable. Are you suggesting that we add the above code as well or that the above code should replace the code that already exists in set_python_path()? I'm not exactly an authority on embedding the python interpreter in C but I believe I ran into the same issue that Lieting is reporting, on a rhel4 box, and solved it with the code above. Coincidentally, the question of the 'correct' way to set the module import path came up just a few days before I had this problem - http://mail.python.org/pipermail/python-list/2008-September/506206.html. I think that just doing the above without setting the PYTHONPATH environment variable should work regardless of the platform. OK. Do you know which version of Python introduced the PySys_GetObject() function? It seems to be an undocumented function until python 2.6. It might be good to know what the issue is with this function and why it wasn't documented before we start using it. Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] mod_python on Solaris not scanning directory
On 9/23/2008 at 7:03 PM, in message [EMAIL PROTECTED], Gilad Raphaelli [EMAIL PROTECTED] wrote: Lieting, I believe I ran into the same issue and cleared it up with this patch to mod_python.c: --- mod_python.c.orig 2008-09-24 10:52:17.0 +1000 +++ mod_python.c2008-09-24 10:59:06.0 +1000 @@ -510,6 +510,11 @@ /* Set up the python path to be able to load module from our module path */ set_python_path(path); Py_Initialize(); + +PyObject *sys_path = PySys_GetObject(path); +PyObject *addpath = PyString_FromString(path); +PyList_Append(sys_path, addpath); + PyEval_InitThreads(); gtstate = PyEval_SaveThread(); This does the equivalent of a 'sys.path.append(path)' without any error handling. I haven't heard back from the developer list whether this is the right approach but perhaps it will get you going for now. The call to the function set_python_path(path) should have done the same thing by setting up the PYTHONPATH environment variable. Are you suggesting that we add the above code as well or that the above code should replace the code that already exists in set_python_path()? Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [Ganglia-developers] can't get cpu_num toshow for whole cluster
On 9/12/2008 at 11:48 AM, in message [EMAIL PROTECTED], Bernard Li [EMAIL PROTECTED] wrote: Hi all: On Fri, Sep 12, 2008 at 10:00 AM, Ofer Inbar [EMAIL PROTECTED] wrote: I added a host to an existing cluster, and noticed the total number of CPU cores for the cluster fluctuate, so I tried restarting all the gmond's in the cluster... but that just made most of the CPU's appear to disappear from Ganglia metrics altogether. I narrowed it down to this: each gmond only reports cpu_num for nodes that restarted after it. If I restart gmond on node1, it reports cpu_num for itself only, even though other gmonds in the cluster are reporting cpu_num for other nodes. If I restart gmond on node2, node1 will now report cpu_num for itself and for node2 ... but node2 has now forgotten cpu_num for all other nodes except itself. And so on. It's a catch-22. I can't make them all see every node's metrics. Ganglia 3.1.0 on CentOS, using multicast only. Since I have a 3.0.7 installation handy, Cos suggested I do an experiment: 1) WIth Ganglia 3.1.1, I restart a gmond that is not listed in the data_source and I nc the host checking for the number of cpu_num lines in the XML stream. This number stays 1 until quite a while (maybe 10 mins). 2) With Ganglia 3.0.7, I did the same test as above, and immediately after restarting gmond the number of cpu_num lines was already at 3, and quite quickly grows in a matter of minutes When I first tried 3.1.x, I always thought it odd that when I restart a gmond, I had to restart *all* the rest of the gmonds to get the right number of total cpus, I guess this confirms my suspicion. If this is indeed a bug/unwanted new behaviour, please discuss this in ganglia-developers. I am wondering if this might be an issue with the way that the metadata for a metric is being sent. The unique attribute about this is that cpu_num is a collect_once metric. This means that if the data value is sent but one of the gmond's in the cluster has not received the metadata yet, the value may get ignored when the XML is written. One interesting test to try to validate this theory would be to set the send_metadata_interval to something greater than zero even in a multicast environment. Then run your same test and see if the same problem shows up or goes away. If the problem goes away, then we might have to rework how the metadata data is being requested and sent in a multicast environment. Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] [ANNOUNCEMENT] Official release of Ganglia 3.1.1
The Ganglia Project (http://ganglia.info) is pleased to announce the official release of Ganglia 3.1.1 The official tarball is available for immediate download at: http://sourceforge.net/project/showfiles.php?group_id=43021package_id=35280release_id=625044 For a full description of the bug fixes and enhancements that are included in the 3.1.1 release as well as upgrade information, please see the current release notes at: http://ganglia.wiki.sourceforge.net/ganglia_release_notes Supported platforms: * Linux (Fedora/RedHat/CentOS, Debian, Gentoo, SuSE/OpenSuSE) * [Open]Solaris * FreeBSD * NetBSD * OpenBSD * DragonflyBSD * Cygwin (no support for DSO yet) * AIX (no support for DSO yet) Please read all the README, INSTALL and other available documentation (http://ganglia.wiki.sourceforge.net) as a lot of things have changed since version 3.0. Use good deployment practices when upgrading from 3.0 to make sure that you do not mix gmond 3.0 and 3.1 nodes in the same cluster (as defined by a multicast address or unicast collector node). The protocol that allows gmond nodes to communicate within the same cluster, has changed. However the XML packets that are passed between gmond and gmetad have remained compatible from 3.0.x to 3.1.x, allowing a 3.1.x gmetad to continue to pull data from an older 3.0.x gmond cluster. Ganglia Development Team - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetad bug when gmond host hangs
On 9/1/2008 at 3:35 PM, in message [EMAIL PROTECTED], Carlo Marcelo Arenas Belon [EMAIL PROTECTED] wrote: On Sat, Aug 30, 2008 at 10:09:02AM -0600, Brad Nicholes wrote: On 8/30/2008 at 12:25 AM, in message [EMAIL PROTECTED], Carlo Marcelo Arenas Belon [EMAIL PROTECTED] wrote: On Fri, Aug 29, 2008 at 02:40:00PM -0400, Ofer Inbar wrote: Should this have made it into 3.1, or 3.1.1? It doesn't look like it. There is a fix in trunk now with r1738 and unless something goes wrong with it, will be most likely released with 3.1.2 and 3.0.8. The proposed backport patch for 3.1.x has been updated in the BUG and officially requested for inclusion in 3.1 (beware it includes 1 extra unrelated change that is needed to prevent future conflicts for backporting but that is otherwise mostly irrelevant, specially if making your own package) but also additional changes that simplify the logic and avoid a possible failure of logic which could result in gmetad crashing, so using this newer version is encouraged : http://bugzilla.ganglia.info/cgi-bin/bugzilla/attachment.cgi?id=165action=vi ew 3.1.1 is already in testing and since this bug is not a showstopper for that specific release, I'd be surprised if the release manager decides it should be backported to it, but that shouldn't prevent you patching your own package with the proposed patch if you don't want to wait. If we are confident enough that the patch for this bug solves the problem I am fairly confident that the patch resolves the problem reported in BUG92 and that matches the description of the problem from Cos, and that is easy to replicate by just sending a SIGSTOP to gmond; otherwise I wouldn't have proposed it here, in bugzilla and the STATUS file. then I have no problem scraping 3.1.1, tagging and rolling 3.1.2 and restarting the testing period. The whole point of the testing period is to flush out problems like this and then determine if the fix is important enough to retag and retest. agree, but doing so will delay releasing the next version of 3.1 and also indirectly (as I won't start that until 3.1 is out to avoid confusion and overstressing our limited testing resources) the next release for 3.0. 3.1.1 includes a fix for a showstopper bug (gmetad crash in hierarchical configuration) and a fix for a high bug (instability with tcpconn.py) which had been already backported in fedora and gentoo distribution packages, but not debian, our own packages or anyone using 3.1 from sources (all other architectures where there are yet no packages available) and that are therefore waiting for a 3.1.1 release. We need some feedback on how serious this problem is Thanks Carlo, this is some good feedback. I know that both Bernard and Cos have reported having issues with this bug. Could either (or both) of you independently confirm that this patch fixes the problem? Also, is this issue serious enough to stop 3.1.1, add the patch to a 3.1.2 tarball and restart the testing period? Since we are only a week into the current testing cycle, this would delay the release of 3.1.2 by a week (considering a two week testing cycle for 3.1.2). Given the bug fixes that were mentioned by Carlo that have already been included in the 3.1.1 tarball, is a week's delay for 3.1.2 which would include this gmetad patch, worth it? The alternative would be to continue with 3.1.1 as scheduled and have this patch wait for the next release cycle which will probably be at least a couple of months out. I need to know quickly so that we can get moving on a 3.1.2 tarball if that is the consensus. If I don't hear any feedback on this issue by Thurs (9/4), I wil l assume that we are good with 3.1.1 and allow the testing cycle to continue as scheduled. Opinions? Feedback? Comments? Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetad bug when gmond host hangs
On 8/30/2008 at 12:25 AM, in message [EMAIL PROTECTED], Carlo Marcelo Arenas Belon [EMAIL PROTECTED] wrote: On Fri, Aug 29, 2008 at 02:40:00PM -0400, Ofer Inbar wrote: Should this have made it into 3.1, or 3.1.1? It doesn't look like it. There is a fix in trunk now with r1738 and unless something goes wrong with it, will be most likely released with 3.1.2 and 3.0.8. 3.1.1 is already in testing and since this bug is not a showstopper for that specific release, I'd be surprised if the release manager decides it should be backported to it, but that shouldn't prevent you patching your own package with the proposed patch if you don't want to wait. If we are confident enough that the patch for this bug solves the problem, then I have no problem scraping 3.1.1, tagging and rolling 3.1.2 and restarting the testing period. The whole point of the testing period is to flush out problems like this and then determine if the fix is important enough to retag and retest. We need some feedback on how serious this problem is and how confident we are in the fix. I would rather throw away 3.1.1 now in favor of a fixed 3.1.2 than half to do this all over again next month. Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] [ANNOUNCEMENT] Ganglia 3.1.1 tarball ready for testing...
In an effort to continue improving the Ganglia software, the Ganglia Project has released an official testing release of Ganglia 3.1.1. The testing tarball is available for immediate download at: http://www.ganglia.info/testing/ The intent of this testing release of Ganglia 3.1.1 is to validate that the source code is stable and that the bug fixes and enhancements that have been added since the previous release of the software, are ready for general release. The release procedure from this point has been documented on the Ganglia wiki site at http://ganglia.wiki.sourceforge.net/ganglia_works under the heading Generating a Release Candidate and GA Release. Basically the Ganglia 3.1.1 testing tarball has been rolled and made available for testing by the Ganglia community. All bugs found in this testing release should be immediately reported through bugzilla (http://bugzilla.ganglia.info) and can be posted to the [EMAIL PROTECTED] mailing list as well. If the bug report is also accompanied by a bug fix patch, this will help avoid delays in producing new testing tarballs and ultimately an official general release of the software. If any critical level bugs are discovered, the current testing release tarball will be thrown away and a new tarball will be rolled and made available for further testing. Once a testing release tarball has been validated by the Ganglia community to be stable and ready for general availability, that tarball will become the official Ganglia 3.1.x release. So basically the sooner we are able to test and validate the Ganglia 3.1 source code, the sooner the project will be able to create an official release. But we need your help to get this done. Any and all testing and feedback, positive or negative, will be greatly appreciated. There will be a two week testing period for this 3.1.1 tarball which begins from the date of this announcement. So please help us to make sure that the tarball is valid and stable by building and installing it on any size of testing environment. Known issues with this testing release will be addressed on the Ganglia wiki site at: http://ganglia.wiki.sourceforge.net/Testing_Release_Notes For those who are interested in upgrading from a current 3.0.x installation, please see the current release notes at: http://ganglia.wiki.sourceforge.net/ganglia_release_notes Supported platforms (additional testing requested): * Linux (Fedora/RedHat/CentOS, Debian, Gentoo, SuSE/OpenSuSE) * [Open]Solaris * FreeBSD * NetBSD * OpenBSD * DragonflyBSD * Cygwin (no support for DSO yet) * AIX (no support for DSO yet) Please read all the README, INSTALL and other available documentation (http://ganglia.wiki.sourceforge.net) as a lot of things have changed since 3.0.7. Use good deployment practices when upgrading from 3.0.x to make sure that you do not mix gmond 3.0 and 3.1 nodes in the same cluster (as defined by a multicast address or unicast collector node). The protocol that allows gmond nodes to communicate within the same cluster, has changed. However the XML packets that are passed between gmond and gmetad have remained compatible from 3.0.x to 3.1.x, allowing a 3.0.x gmetad to continue to pull data from a newer 3.1.x gmond cluster. happy testing - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Interrupted system call error shown by gmond -d 9
On 8/15/2008 at 4:11 PM, in message [EMAIL PROTECTED], Sid Stuart [EMAIL PROTECTED] wrote: Hi, Has anyone else seen this error when running gmond in debug mode (gmond -d 9)? loaded module: python_module udp_recv_channel mcast_join=239.2.11.82 mcast_if=NULL port=8649 bind= 239.2.11.82 Exception in thread Thread-1: Traceback (most recent call last): File /usr/lib64/python2.4/threading.py, line 442, in __bootstrap self.run() File /usr/lib64/ganglia/python_modules/tcpconn.py, line 252, in run poll_events = fd_poll.poll() error: (4, 'Interrupted system call') tcp_accept_channel bind=NULL port=8649 udp_send_channel mcast_join=239.2.11.82 mcast_if=NULL host=NULL port=8649 I know tcpconn.py is has an issue with gmond -m, but this is with gmond in standard mode. I am looking at it because all the tcpconn graphs on my system display lines that zeroed (no values.) Note I am running the Fedora Ganglia RPM's and they have been buggy in the past. :) Sid Yes, this has been fixed in trunk and the 3.1.x branch. If you want to try it, just grab the latest version of tcpconn.py from SVN and copy it to the python modules directory. Then restart gmond and see if that solves the problem. Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Patch - no_extra_data
On Thu, Aug 14, 2008 at 3:33 PM, in message [EMAIL PROTECTED], Doug Nordwall [EMAIL PROTECTED] wrote: Here's a patch for ganglia. it allows the no_extra_data option to be added to the config file. when this is set to yes, it will not send any EXTRA_DATA or EXTRA_ELEMENTS in the xml. when set to no (the default) the normal ganglia output is kept. it is against 3.1.0. documentation has been patched as well. I updated the patch so that it works against trunk. I also changed the directive to be positive rather than negative (allow_extra_data vs no_extra_data). Check out the new patch in bugzilla. Also the documentation portion of the patch couldn't be applied. The .pod files need to be updated. Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Debugging Gmond Python Metric Module
On 8/13/2008 at 10:42 AM, in message [EMAIL PROTECTED], Sid Stuart [EMAIL PROTECTED] wrote: After fixing a tabbing bug in your cacheHits() function, everything loaded fine and the callback function was called as it should be. The callback didn't actually work on my system, but that is a different problem. Make sure that you don't have any python syntax errors in your module. Otherwise the module won't load. Brad Hi Brad, Thanks for going to all the effort. I think the tabbing bug was inserted by cut and paste as the cacheHits() function works on my system. I am beginning to believe the problem is with running any Python Metric Module. I have commented my module out of the system and inserted the tcpconn.pyconf into the the configuration. When I run gmond -d 9 in that configuration, I get a similar error message, gmond -d 9 loaded module: core_metrics loaded module: cpu_module loaded module: disk_module loaded module: load_module loaded module: mem_module loaded module: net_module loaded module: proc_module loaded module: sys_module udp_recv_channel mcast_join=239.2.11.82 mcast_if=NULL port=8649 bind= 239.2.11.82 tcp_accept_channel bind=NULL port=8649 udp_send_channel mcast_join=239.2.11.82 mcast_if=NULL host=NULL port=8649 Unable to collect metric 'tcp_established' on this platform. Exiting. The tcp_established metric is the first metric in the tcpconn collection group. Are Python Metric Modules supposed to work with gmond version 3.1.0.1399? Sid Yes, python metric module support was one of the major features of Ganglia 3.1. Which version of python are you using? There are some issue with older versions previous to 2.4. Did you try the -m parameter on gmond to see if your metric is showing up in the list? In the debug listing that you provided, I don't see mod_python.so being loaded. Are you sure that mod_python is configured and loaded? You should have a modpython.conf file along with the rest of your .conf files in /etc/ganglia/conf.d. If mod_python isn't even being loaded, then there isn't any python module support. Brad Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Debugging Gmond Python Metric Module
On 8/12/2008 at 3:03 PM, in message [EMAIL PROTECTED], Sid Stuart [EMAIL PROTECTED] wrote: Hi, I have written a small Python metric module that contains one metric, CacheHits. When the module is included in the configuration, gmond spits out the following error message, Unable to collect metric 'CacheHits' on this platform. Exiting. As far as I can tell, gmond is parsing the configuration file and accepting the data in it, but cannot seem to find/run the handler for the metric. I have compiled and run the Python module and it works standalone. Is there any way to get gmond to provide more detail on what is failing? The configuration file and the python module are included below, This message usually means that something was wrong with the metric definition and gmond doesn't recognize the metric name. If you invoke gmond with a -m parameter to list out all of the metrics, does your CacheHits metric show up in the list? If not then your module either isn't being loaded or the metric definition that is returned by your module has a problem. Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Debugging Gmond Python Metric Module
On 8/12/2008 at 3:03 PM, in message [EMAIL PROTECTED], Sid Stuart [EMAIL PROTECTED] wrote: Hi, I have written a small Python metric module that contains one metric, CacheHits. When the module is included in the configuration, gmond spits out the following error message, Unable to collect metric 'CacheHits' on this platform. Exiting. As far as I can tell, gmond is parsing the configuration file and accepting the data in it, but cannot seem to find/run the handler for the metric. I have compiled and run the Python module and it works standalone. Is there any way to get gmond to provide more detail on what is failing? After fixing a tabbing bug in your cacheHits() function, everything loaded fine and the callback function was called as it should be. The callback didn't actually work on my system, but that is a different problem. Make sure that you don't have any python syntax errors in your module. Otherwise the module won't load. Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [ANNOUNCEMENT] Official release of Ganglia 3.1.0
For those that haven't seen it yet, check out the press release that Groundwork Open Source did for the Ganglia 3.1 release. http://www.marketwatch.com/news/story/ganglia-31-provides-easy-customize/story.aspx?guid=%7B39B62CFF-06F1-40DC-A00F-3F4A5B8FFEB1%7Ddist=hppr Thanks to everybody that helped get Ganglia 3.1.0 out the door. Brad On 7/30/2008 at 2:42 PM, in message [EMAIL PROTECTED], Brad Nicholes [EMAIL PROTECTED] wrote: The Ganglia Project (http://ganglia.info) is pleased to announce the first official release of Ganglia 3.1.0 The official tarball is available for immediate download at: http://sourceforge.net/project/showfiles.php?group_id=43021package_id=35280r elease_id=616721 Please refer to http://ganglia.wiki.sourceforge.net/ganglia_release_notes for more information. The main features of this release are: * Introduction of a modular metric interface for C and Python (DSO support) * Scriptable metric module support with Python * All pre-existing metrics (CPU, network, disk, memory, etc.) converted to metric modules * Introduction of new metric modules multicpu, multidisk and tcp_conn status * Modular frontend graph support * Metric groups which can be viewed or hidden as desired * Additional scaling capacity for systems with memory greater than 4TB * Platform support for DragonFlyBSD * Improved native metric support for Windows (Built with CygWin) * Bug fixes and Enhancements Supported platforms: * Linux (Fedora/RedHat/CentOS, Debian, Gentoo, SuSE/OpenSuSE) * [Open]Solaris * FreeBSD * NetBSD * OpenBSD * DragonflyBSD * Cygwin (no support for DSO yet) * AIX (no support for DSO yet) Please read all the README, INSTALL and other available documentation (http://ganglia.wiki.sourceforge.net) as a lot of things have changed since 3.0.7. Use good deployment practices when upgrading from 3.0.x to make sure that you do not mix gmond 3.0 and 3.1 nodes in the same cluster (as defined by a multicast address or unicast collector node). The protocol that allows gmond nodes to communicate within the same cluster, has changed. However the XML packets that are passed between gmond and gmetad have remained compatible from 3.0.x to 3.1.x, allowing a 3.0.x gmetad to continue to pull data from a newer 3.1.x gmond cluster. For those who are interested in upgrading from a 3.0.x installation, your current gmond and gmetad configuration files will need to be moved from their current location to /etc/ganglia. If you are attempting the upgrade via an RPM, the RPM will automatically move your current configuration file to the new location. However, for gmond, the 3.0.x conf file will not work. Please use the patch file gmond-3.1.patch available at http://www.ganglia.info/releases/ to patch your gmond.conf prior to starting, otherwise gmond will fail to startup. There are several known issues with the current release which include the following: * no support for C++ to create DSO modules * no spoofing from modular metrics (use gmetric if spoofing is needed) * race condition for tcpconn python metric module (affects gmond -m) * libdir issues related to building for 64bit platforms * known build issues for platforms: - Darwin (AKA MacOS/X) - HPUX - Tru64 (AKA OSF/1) - Irix Many of the above issues are being addressed and patches will be applied for the next minor release of Ganglia 3.1.x. In addition more information about the current official release, can be found on the Ganglia wiki at http://ganglia.wiki.sourceforge.net/ganglia_release_notes. Ganglia Development Team - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] [ANNOUNCEMENT] Official release of Ganglia 3.1.0
The Ganglia Project (http://ganglia.info) is pleased to announce the first official release of Ganglia 3.1.0 The official tarball is available for immediate download at: http://sourceforge.net/project/showfiles.php?group_id=43021package_id=35280release_id=616721 Please refer to http://ganglia.wiki.sourceforge.net/ganglia_release_notes for more information. The main features of this release are: * Introduction of a modular metric interface for C and Python (DSO support) * Scriptable metric module support with Python * All pre-existing metrics (CPU, network, disk, memory, etc.) converted to metric modules * Introduction of new metric modules multicpu, multidisk and tcp_conn status * Modular frontend graph support * Metric groups which can be viewed or hidden as desired * Additional scaling capacity for systems with memory greater than 4TB * Platform support for DragonFlyBSD * Improved native metric support for Windows (Built with CygWin) * Bug fixes and Enhancements Supported platforms: * Linux (Fedora/RedHat/CentOS, Debian, Gentoo, SuSE/OpenSuSE) * [Open]Solaris * FreeBSD * NetBSD * OpenBSD * DragonflyBSD * Cygwin (no support for DSO yet) * AIX (no support for DSO yet) Please read all the README, INSTALL and other available documentation (http://ganglia.wiki.sourceforge.net) as a lot of things have changed since 3.0.7. Use good deployment practices when upgrading from 3.0.x to make sure that you do not mix gmond 3.0 and 3.1 nodes in the same cluster (as defined by a multicast address or unicast collector node). The protocol that allows gmond nodes to communicate within the same cluster, has changed. However the XML packets that are passed between gmond and gmetad have remained compatible from 3.0.x to 3.1.x, allowing a 3.0.x gmetad to continue to pull data from a newer 3.1.x gmond cluster. For those who are interested in upgrading from a 3.0.x installation, your current gmond and gmetad configuration files will need to be moved from their current location to /etc/ganglia. If you are attempting the upgrade via an RPM, the RPM will automatically move your current configuration file to the new location. However, for gmond, the 3.0.x conf file will not work. Please use the patch file gmond-3.1.patch available at http://www.ganglia.info/releases/ to patch your gmond.conf prior to starting, otherwise gmond will fail to startup. There are several known issues with the current release which include the following: * no support for C++ to create DSO modules * no spoofing from modular metrics (use gmetric if spoofing is needed) * race condition for tcpconn python metric module (affects gmond -m) * libdir issues related to building for 64bit platforms * known build issues for platforms: - Darwin (AKA MacOS/X) - HPUX - Tru64 (AKA OSF/1) - Irix Many of the above issues are being addressed and patches will be applied for the next minor release of Ganglia 3.1.x. In addition more information about the current official release, can be found on the Ganglia wiki at http://ganglia.wiki.sourceforge.net/ganglia_release_notes. Ganglia Development Team - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [Ganglia-developers] [ANNOUNCEMENT] Ganglia 3.1.0 tarball ready fortesting...
It has been two weeks since the announcement of the availability of the Ganglia 3.1.0 testing tarball. Since that time, I haven't seen any reports of showstopper issues. Unless there are any objections or critical bug reports that have not yet been reported, I propose that we release the 3.1.0 tarball as first official release of the Ganglia 3.1.x series. Comments? Votes? Brad On 7/15/2008 at 1:57 PM, in message [EMAIL PROTECTED], Brad Nicholes [EMAIL PROTECTED] wrote: The Ganglia Project is pleased to announce the first official testing release of Ganglia 3.1.x. The testing tarball is available for immediate download at: http://www.ganglia.info/testing/ The intent of this first testing release of Ganglia 3.1.x is to validate that the source code is stable and that the new feature set that is included in the 3.1 version of the software, is ready for general release. The release procedure from this point has been documented on the Ganglia wiki site at http://ganglia.wiki.sourceforge.net/ganglia_works under the heading Generating a Release Candidate and GA Release. Basically the Ganglia 3.1.0 testing tarball has been rolled and made available for testing by the Ganglia community. All bugs found in this testing release should be immediately reported through bugzilla (http://bugzilla.ganglia.info) and can be posted to the [EMAIL PROTECTED] mailing list as well. If the bug report is also accompanied by a bug fix patch, this will help avoid delays in producing new testing tarballs and ultimately an official general release of the software. If any critical level bugs are discovered, the current testing release tarball will be thrown away and a new tarball will be rolled and made available for further testing. Once a testing release tarball has been validated by the Ganglia community to be stable and ready for general availability, that tarball will become the official Ganglia 3.1.x release. So basically the sooner we are able to test and validate the Ganglia 3.1 source code, the sooner the project will be able to create an official release. But we need your help to get this done. Any and all testing and feedback, positive or negative, will be greatly appreciated. There are several known issues with the current release which include the following: * no support for C++ to create DSO modules * no spoofing from modular metrics (use gmetric if spoofing is needed) * race condition for tcpconn python metric module (affects gmond -m) * libdir issues related to building for 64bit platforms * known build issues for platforms: - Darwin (AKA MacOS/X) - HPUX - Tru64 (AKA OSF/1) - Irix Many of the above issues are being addressed and patches will be applied for the next minor release of Ganglia 3.1.x. In addition more information about the current testing release or official release, can be found on the Ganglia wiki at http://ganglia.wiki.sourceforge.net/ganglia_release_notes. For those who are interested in upgrading from a current 3.0.x installation, your current gmond and gmetad configuration files will need to be moved from there current location to /etc/ganglia. If you are attempting the upgrade via an RPM, the RPM will automatically move your current configuration file to the new location. However, for gmond, the 3.0.x conf file will not work. Please use the patch file gmond-3.1.patch (available from the testing URL above) to patch your gmond.conf prior to starting, otherwise gmond will fail to startup. The main features of this release are : * Dynamically loaded metric module support (DSO) * Scriptable metric module support with Python * Modular frontend graph support * Platform support for DragonFlyBSD * Improved native metric support for Windows (Built with CygWin) * Bug fixes and Enhancements Supported platforms (additional testing requested): * Linux (Fedora/RedHat/CentOS, Debian, Gentoo, SuSE/OpenSuSE) * [Open]Solaris * FreeBSD * NetBSD * OpenBSD * DragonflyBSD * Cygwin (no support for DSO yet) * AIX (no support for DSO yet) Please read all the README, INSTALL and other available documentation (http://ganglia.wiki.sourceforge.net) as a lot of things have changed since 3.0.7. Use good deployment practices when upgrading from 3.0.x to make sure that you do not mix gmond 3.0 and 3.1 nodes in the same cluster (as defined by a multicast address or unicast collector node). The protocol that allows gmond nodes to communicate within the same cluster, has changed. However the XML packets that are passed between gmond and gmetad have remained compatible from 3.0.x to 3.1.x, allowing a 3.0.x gmetad to continue to pull data from a newer 3.1.x gmond cluster. happy testing - This SF.Net email is sponsored by the Moblin Your Move
Re: [Ganglia-general] [Ganglia-developers] [ANNOUNCEMENT]Ganglia 3.1.0 tarball ready fortesting...
On 7/29/2008 at 10:06 AM, in message [EMAIL PROTECTED], Marc Van Kerkhoven1 [EMAIL PROTECTED] wrote: Hi Brad, One minor bug would be that the gmetrics link is no longer visible in the host view. Not sure if this is because I have done anything wrong, but it's pretty much a vanilla install. Not sure if this is intentional with the introduction of the python modules. kind regards, Marc van Kerkhoven Thanks for installing and testing the 3.1 testing tarball. Removing the gmetric link from the host view was intentional. In Ganglia 3.1, everything is reported as a metric rather than differentiating between a base metric vs a module metric vs a gmetric. There was some discussion on the mailing list about this at http://www.mail-archive.com/[EMAIL PROTECTED]/msg04135.html . So if gmetric is used to feed gmond with extra metric data, the gmetric will just show up along side of the rest of the standard metrics. Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [Ganglia-developers] [ANNOUNCEMENT]Ganglia 3.1.0 tarball ready fortesting...
On 7/29/2008 at 11:18 AM, in message [EMAIL PROTECTED], Bernard Li [EMAIL PROTECTED] wrote: Hi Brad: On Tue, Jul 29, 2008 at 9:27 AM, Brad Nicholes [EMAIL PROTECTED] wrote: Thanks for installing and testing the 3.1 testing tarball. Removing the gmetric link from the host view was intentional. In Ganglia 3.1, everything is reported as a metric rather than differentiating between a base metric vs a module metric vs a gmetric. There was some discussion on the mailing list about this at http://www.mail-archive.com/[EMAIL PROTECTED]/msg041 35.html . So if gmetric is used to feed gmond with extra metric data, the gmetric will just show up along side of the rest of the standard metrics. Maybe we should document this as part of the changes for 3.1.0 in the Wiki? Sounds like a good idea. Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] Current testing cycle for Ganglia 3.1 release...
Just as a reminder, there is currently a testing release of Ganglia 3.1 available for immediate testing and feedback. This testing release is available at: http://www.ganglia.info/testing/ Please see the previous announcement for more information. http://www.mail-archive.com/[EMAIL PROTECTED]/msg04510.html The testing release has been available for a week and our current goal is to finish up testing and validation over the next week. If any major bugs are discovered within this time period, we would like to address them and determine if a follow up testing release and testing period is necessary. But we really need the community's help. Please download the testing release, install it on a staging machine and give it a try. Let us know how it goes and especially if you run into any major issues with installation or execution. Any and all feedback, positive or negative, will be welcome. thanks, Brad - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetad giving high TN values
On 6/25/2008 at 12:13 PM, in message [EMAIL PROTECTED], Kirk McDonald [EMAIL PROTECTED] wrote: I have a gmetad which probes a number of gmonds, and each gmond has a number of hosts associated with it. When I scrape the XML from each of the gmonds probed by gmetad myself, the TN value for each host looks good (they average well under 10 seconds). However, when I scrape the XML from gmetad, the TN values for each host are much higher, enough so that it begins marking many of the hosts as down. I was wondering what could cause this to happen. One possibility is the time difference between the gmond nodes and the gmetad host. Gmetad will try to normalize all of the timestamps based on its own timestamp. If there is a big time difference between a gmond node and the gmetad host, the calculation will be skewed. Brad - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetad giving high TN values
On 6/25/2008 at 1:18 PM, in message [EMAIL PROTECTED], Kirk McDonald [EMAIL PROTECTED] wrote: On Wed, Jun 25, 2008 at 11:48 AM, Brad Nicholes [EMAIL PROTECTED] wrote: On 6/25/2008 at 12:13 PM, in message [EMAIL PROTECTED], Kirk McDonald [EMAIL PROTECTED] wrote: I have a gmetad which probes a number of gmonds, and each gmond has a number of hosts associated with it. When I scrape the XML from each of the gmonds probed by gmetad myself, the TN value for each host looks good (they average well under 10 seconds). However, when I scrape the XML from gmetad, the TN values for each host are much higher, enough so that it begins marking many of the hosts as down. I was wondering what could cause this to happen. One possibility is the time difference between the gmond nodes and the gmetad host. Gmetad will try to normalize all of the timestamps based on its own timestamp. If there is a big time difference between a gmond node and the gmetad host, the calculation will be skewed. Brad I do not think this is the problem. The problem becomes more and less apparent if I bring up or shut down gmond on portions of my hosts. If all of the gmonds on all of the hosts are up and running, the average TN creeps up and large portions of the grid are marked down. (The portions marked down appear to be more or less random.) If I shut down a significant portion of the gmonds, the average TN on gmetad drops, and the hosts are marked up. (That is, ignoring the TN on the machines I actually took down, which naturally rises continually.) I would expect a calculation error like that to be independent of the number of hosts being monitored. There is a definite correlation between average TN (as reported by gmetad) and the number of hosts being monitored. -Kirk Are all of your nodes in a single cluster? There may be some latency issues with the size of the XML that gmetad has to parse. If you create multiple clusters and reference them through different data sources, gmetad will create multiple threads which only have to parse a portion of the whole. If gmetad is on a multiproc box, the multiple threads can take better advantage of the cpus rather than parsing everything sequentially. Brad - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Compiling gmond c module
You will need to figure out where the u_short conflict is coming from. My first guess would be to use gcc rather than g++. Brad On 6/7/2008 at 8:35 PM, in message [EMAIL PROTECTED], Fábio Firmo [EMAIL PROTECTED] wrote: Hi everyone, I'm about to introduce Ganglia in a project to take care of the monitoring of clusters. We need to know the cpu and memory usage for some specifics processes, so I'm trying to write a C module for gmond to do this. I've installed ganglia 3.1.0.1361 snapshot (devel, libganglia and gmond), but the mod_example doesn't compiles. Here the error: $ g++ mod_example.c -I/usr/include/apr-1/ #there's a problem finding apr.h, so the need for -I In file included from /usr/include/gm_metric.h:7, from mod_example.c:33: /usr/include/gm_protocol.h:73: error: declaration of 'u_short Ganglia_gmetric_ushort::u_short' /usr/include/sys/types.h:36: error: changes meaning of 'u_short' from 'typedef __u_short u_short' /usr/include/gm_protocol.h:94: error: declaration of 'u_int Ganglia_gmetric_uint::u_int' /usr/include/sys/types.h:37: error: changes meaning of 'u_int' from 'typedef __u_int u_int' mod_example.c:117: warning: deprecated conversion from string constant to 'char*' mod_example.c:117: warning: deprecated conversion from string constant to 'char*' mod_example.c:117: warning: deprecated conversion from string constant to 'char*' mod_example.c:117: warning: deprecated conversion from string constant to 'char*' mod_example.c:117: warning: deprecated conversion from string constant to 'char*' mod_example.c:117: warning: deprecated conversion from string constant to 'char*' mod_example.c:117: warning: deprecated conversion from string constant to 'char*' mod_example.c:117: warning: deprecated conversion from string constant to 'char*' mod_example.c:117: warning: deprecated conversion from string constant to 'char*' mod_example.c:117: warning: deprecated conversion from string constant to 'char*' mod_example.c:119: error: redefinition of 'mmodule example_module' mod_example.c:42: error: 'mmodule example_module' previously declared here Can someone help me? Thanks in advance, Fábio - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] SPOOF option
On 6/18/2008 at 7:39 AM, in message [EMAIL PROTECTED], LINDA DOBAI [EMAIL PROTECTED] wrote: Hi, These days I was testing the 3.1.0.1399 release of Ganglia , most of all the Python modules feature. I managed to plug several metrics into Ganglia using the new feature and it works very well. I saw that SPOOF_HOST and SPOOF_NAME aren't attributes of the metric descriptors , but could be attached to the metric as extradata . How could I attach this option to the metric? Is it possible in the 3.1.0.1399 release of Ganglia? Could I have more information about this? It isn't possible to create a spoofing metric module in the 3.1 code yet. This is functionality that I just recently added to the trunk source tree but since it touched so many files, I didn't want to take a chance of destabilizing the 3.1 branch. If you pull and build the trunk source tree, you should be able to build a spoofing metric module by simply adding the SPOOF_HOST and SPOOF_NAME to the metric description for a python module, or adding the same data as extra data for a C module. Python example: d1 = {'name': 'my_spoof_metric', 'call_back': my_spoof_handler, 'time_max': 90, 'value_type': 'uint', 'units': '%', 'slope': 'both', 'format': '%u', 'description': 'Some spoofed metric', 'groups': 'spoof', 'spoof_host': 'spoof_ip_address:spoof_host_name', #Same format as the gmetric -S option 'spoof_name': 'alternate_metric_name'} #Optionally specify and alternate metric name Hope this helps, Brad - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] UUID status?
On 6/13/2008 at 1:08 AM, in message [EMAIL PROTECTED], Carlo Marcelo Arenas Belon [EMAIL PROTECTED] wrote: On Tue, Jun 10, 2008 at 04:59:57PM -0600, Brad Nicholes wrote: it can be solved using Ganglia 3.1 and the new gmetad-python rewrite. does it mean that you are planning on adding a backport of gmetad-python to the 3.1 release? No, not unless there is a real demand for it. The gmetad-python version is intended for a 3.2 or later release. It could also be release in the meantime as its own subproject or snapshot if there is a demand for that. Carlo PS. the UUID feature will also need changes in the frontend that don't exist yet. - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] UUID status?
On 6/10/2008 at 4:45 PM, in message [EMAIL PROTECTED], Michael Place [EMAIL PROTECTED] wrote: Hi all, The ganglia wish list at http://ganglia.wiki.sourceforge.net/ganglia_wish-list lists the following gmetad todo: * Name RRD directories based on UUID generated by client gmond Can anybody report on the status of this feature? It would be extremely useful in our implementation. Nothing has been done for this feature in the current implementation of gmetad. However, the rewrite of gmetad in python will be able to accommodate this type of feature. What I mean by accommodate is that the gmetad-python version has implemented a pluggable interface and uses an RRD plug in to write metric data to RRD files. This means that rather than having the RRD functionality built into gmetad itself, the RRD functionality can be plugged and replaced. That would mean that for most people, they would probably want to use the standard RRD plugin. For you or anybody else that wanted it, you could replace the standard RRD plugin with one that has been modified to create directories using the UUID rather than host names. As far as gathering a UUID from a host, this can be done already by simply implementing a UUID metric module. Then when the gmetad-python RRD storage module sees the UUID metric for a host, that is what it would use for the directory name. So the short answer is that no direct work has been done for this wish list item. However, it can be solved using Ganglia 3.1 and the new gmetad-python rewrite. Brad - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Ganglia 3.1.0 package
On 6/10/2008 at 11:17 AM, in message [EMAIL PROTECTED], Bernard Li [EMAIL PROTECTED] wrote: Hi Stephen: On Tue, Jun 10, 2008 at 10:07 AM, Big Woobie [EMAIL PROTECTED] wrote: I'm running Redhat Linux and IBM's AIX. AIX I can't help you (I think Ulf is trying to get that working). It is possible to install on RHEL 4.x and 5.x (and clones). If you let me know which version you are on, I should be able to provide you with the RPM dependencies, provided that your arch is x86. This kind of information would also be good on our wiki FAQ page. We can build up links to libraries as needed. Many of the linux OS distros already include most if not all of the components. Brad - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Plugging Metrics in 3.1.0 release
On 6/6/2008 at 2:28 AM, in message [EMAIL PROTECTED], LINDA DOBAI [EMAIL PROTECTED] wrote: Thank you very much for your responses. As OS I am using Linux RedHat 5 32bits. As Ganglia version, I installed the last version that I found at the following URL: http://www.ganglia.info/snapshots/3.1.x/ which means : Ganglia 3.1.0.1361. How did I install it? I downloaded Ganglia 3.1.0.1361 I saw that Ganglia need some other libraries so I used Yum tool to install expat and apr libraries. I needed to download and install also confuse library and RRDtool-1.2.27. After the installation, all that worked in Ganglia 3.0.7 worked correctly in this new version also. I don't know why modpython.so wasn't generated. Configure options that I used: ./configure --enable-static-build --enable-python --with-gmetad Is there a reason why you need to include --enable-static-build? If not then don't use this option and let Ganglia build and link everything dynamically. Although the metric modules such as mod_python will work statically linked, they are intended to be dynamically linked for flexibility. Brad - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Plugging Metrics in 3.1.0 release
On 6/5/2008 at 9:20 AM, in message [EMAIL PROTECTED], LINDA DOBAI [EMAIL PROTECTED] wrote: Dear Ganglia community: I am a beginner in Ganglia. I have just started an internship of four months and my subject is related to Ganglia. I have to test the new highlights of release 3.1.0. I had just installed the Ganglia 3.1.0 and I managed to compile it in our environment. I am now trying to analyze the module support for dynamically plugging metrics into gmond and I am having some problems. I don't manage to generate the modpython.so in our environment. I found the sources in gmond/modules/python but I don't see how could I generate the modpython.so, the only lib generated by the Ganglia tool chain are : mod_python.o , mod_python.lo et libmodpython.la. I found a version of modpython.so at the official website of Ganglia, but it doesn't fit to our environment. Is it possible to generate it myself, in my environement? If yes, how could I do this? And, can I have some other details about this new feature? I don't understand very well how this feature will function. Which OS are you building for and what were the ./configure options that you used? This will help to determine why mod_python might not be building. Also there is additional information about metric module development in the README file here http://ganglia.svn.sourceforge.net/viewvc/ganglia/trunk/monitor-core/gmond/modules/python/README?revision=1103view=markup as well as examples of both C and Python modules here http://ganglia.svn.sourceforge.net/viewvc/ganglia/trunk/monitor-core/gmond/modules/example/ and http://ganglia.svn.sourceforge.net/viewvc/ganglia/trunk/monitor-core/gmond/python_modules/example/ . There is also a Ganglia presentation that describes the 3.1 version and how to write and install metric modules here http://ganglia.wiki.sourceforge.net/space/showimage/ApacheconUS2007_ganglia_monitoring.ppt . Let us know how your testing goes. We would like to make an official release of 3.1 soon but that depends on community members like yourself, testing the code. thanks, Brad - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Plugging Metrics in 3.1.0 release
On 6/5/2008 at 2:15 PM, in message [EMAIL PROTECTED], Bernard Li [EMAIL PROTECTED] wrote: On Thu, Jun 5, 2008 at 12:32 PM, Brad Nicholes [EMAIL PROTECTED] wrote: Which OS are you building for and what were the ./configure options that you used? This will help to determine why mod_python might not be building. Also there is additional information about metric module development in the README file here http://ganglia.svn.sourceforge.net/viewvc/ganglia/trunk/monitor-core/gmond/mod ules/python/README?revision=1103view=markup as well as examples of both C and Python modules here http://ganglia.svn.sourceforge.net/viewvc/ganglia/trunk/monitor-core/gmond/mod ules/example/ and http://ganglia.svn.sourceforge.net/viewvc/ganglia/trunk/monitor-core/gmond/pyt hon_modules/example/ . There is also a Ganglia presentation that describes the 3.1 version and how to write and install metric modules here http://ganglia.wiki.sourceforge.net/space/showimage/ApacheconUS2007_ganglia_m onitoring.ppt . I also started writing an entry in the wiki about the Python modules. Brad if you could go over it and correct any errors, that'd be great: http://ganglia.wiki.sourceforge.net/ganglia_gmond_python_modules Cheers, Bernard Done. Hopefully this will be a good guide for somebody that is getting started with Python modules. We still need to write an equivalent C module doc. Brad - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Plugging Metrics in 3.1.0 release
On 6/5/2008 at 4:48 PM, in message [EMAIL PROTECTED], Bernard Li [EMAIL PROTECTED] wrote: Hi Brad: On Thu, Jun 5, 2008 at 3:23 PM, Brad Nicholes [EMAIL PROTECTED] wrote: Done. Hopefully this will be a good guide for somebody that is getting started with Python modules. We still need to write an equivalent C module doc. Thanks for the updates. BTW I am curious about the 'modules' section in the .pyconf -- is it really necessary? My Python module has been running fine without that section (it only has the 'collection_group section'). I also have any question regarding the metric_init parameters. Is 'format' really necessary since we could probably determine it from the 'value_type' -- i.e. value_type = uint, format = %u; value_type = string, format = %s, etc. Cheers, Bernard The modules sections is one of those things that is becoming more important as functionality grows. Initially for a python metric module the modules section was really unnecessary. The name directive is really not used if there are no module parameters, however their is some validation taking place against the language directive. Also if your module requires any kind of configuration parameters, the only way to specify the parameters is through a modules section. So even though your python module might run fine without a modules section today, it is probably a good idea to get used to adding it to the pyconf file just to make sure that you don't run into problems in the future. Brad - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Plugging Metrics in 3.1.0 release
On 6/5/2008 at 5:30 PM, in message [EMAIL PROTECTED], Bernard Li [EMAIL PROTECTED] wrote: Hi Brad: On Thu, Jun 5, 2008 at 4:25 PM, Brad Nicholes [EMAIL PROTECTED] wrote: The modules sections is one of those things that is becoming more important as functionality grows. Initially for a python metric module the modules section was really unnecessary. The name directive is really not used if there are no module parameters, however their is some validation taking place against the language directive. Also if your module requires any kind of configuration parameters, the only way to specify the parameters is through a modules section. So even though your python module might run fine without a modules section today, it is probably a good idea to get used to adding it to the pyconf file just to make sure that you don't run into problems in the future. My suggestion would be to add code to make sure that the .pyconf file has the correct sections and parameters. You can't always expect the users to do the right thing :-) There really isn't anyway to do that kind of validation. Libconfuse will validate that a given directive is syntactically correct, but there isn't a way to validate that a modules section exists for each module, especially a python module. If a module section does not exist for a C module, the C module will never load and gmond will fail if a collection group contains a metric that is not supported by any loaded module. However a python module is different because it is run through mod_python. Basically gmond has no idea that a python module even exists. It thinks that all of the python metrics belong to mod_python which is a C module. Mod_python loads any .py file that it finds in the python module directory and then queries each one for metric definitions which is a completely different code path than for a C module. Brad - Check out the new SourceForge.net Marketplace. It's the best place to buy or sell services for just about anything Open Source. http://sourceforge.net/services/buy/index.php ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetad.conf for large number of nodes
On 5/27/2008 at 8:47 PM, in message [EMAIL PROTECTED], randy [EMAIL PROTECTED] wrote: Brad Nicholes wrote: Is there a reason why you would want to list all 120 nodes in the data_source directive of gmetad? When you list multiple modes in a data_source directive, it does not mean that gmetad is pinging all of them for data. It simply means that if gmetad can not get data from the first one in the list, it tries the next one. Only because I thought I needed too. I'm doing the multicast, but I just thought that all the nodes had to be listed. architecture, data is pushed from a monitored node to either a multicast channel or to a single gmond master node. A gmetad data_source simply references one node that is listening on a multicast channel or the single gmond master node in the case of unicast. There should never be a need to have more than just a few nodes listed in a data_source. So if all 120 nodes are talking on the multicast address, and gmond is running on a different node on the same net, I can get by with just giving localhosts as the data_source? Or any one (or more) of the data nodes? I think that's how I left it today, and I was seeing reports from 30 of the nodes and the other 90 were listed as down. The gmond machine is also serving the webpages, and all machines can see each other on the network. Appreciate the help, thanks Brad! randy I would suggest that you break up your 120 nodes into separate cluster on different multicast channels. Then you would have a different data source for each cluster. By putting them all on the same channel, every gmond agent is required to store all of the metric data for all 120 nodes. That is a huge waste of memory. I would suggest breaking them up in to smaller clusters. If that doesn't work for you, then you might want to move to unicast rather than multicast. In unicast mode all of your gmond agents talk to one or more gmond master nodes directly rather than on a channel. Then each of the master gmond nodes becomes a separate data source. Search back through the email list archive for similar questions. Optimal Ganglia architectures have been discussed previously including multicast vs unicast. Brad - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetad.conf for large number of nodes
On 5/26/2008 at 7:18 AM, in message [EMAIL PROTECTED], randy [EMAIL PROTECTED] wrote: I'm trying to configure ganglia (3.0.7) to monitor 120 nodes. It works fine if I just enter a small number of nodes as data_source in the gmetad.conf file, just like all the documentation shows. But if I try to enter too many nodes, gmetad segfaults at startup. That is, data_source cloud rack1node1 rack1node2 works just fine. But a data_source entry with all 120 nodes listed segfaults. I've also tried using a backslash-newline to break the line apart with the same result. My last attempt used multiple lines, all with the same cluster name (cloud). I have about 10 lines like that, as many as it took to list all the nodes. That doesn't segfault, but it only shows some of the nodes (the last 30, which span multiple data_source entries). Admittedly it was Friday afternoon when I tried this, so I didn't spend too much time debugging. But it's been bothering me all weekend. I haven't seen any examples/docs with more than a couple nodes listed. Is there a reason why you would want to list all 120 nodes in the data_source directive of gmetad? When you list multiple modes in a data_source directive, it does not mean that gmetad is pinging all of them for data. It simply means that if gmetad can not get data from the first one in the list, it tries the next one. In the Ganglia architecture, data is pushed from a monitored node to either a multicast channel or to a single gmond master node. A gmetad data_source simply references one node that is listening on a multicast channel or the single gmond master node in the case of unicast. There should never be a need to have more than just a few nodes listed in a data_source. Brad - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Migrating existing RRD's to a new server;
This looks like a useful script. Can we add it to the contrib area in the Ganglia repository? Brad On 5/21/2008 at 9:51 AM, in message [EMAIL PROTECTED], Jason A. Smith [EMAIL PROTECTED] wrote: A few years ago I had put a script on ganglia's bugzilla that modifies the rrd files to do a few simple things, like change the heartbeat value and change the number of RRAs, see: http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=33 Since we are in the process of moving our gmetad server from an old 32-bit server to a new 64-bit server, I also needed the ability to move the rrd files, which meant dumping and restoring them, so I added this feature to the old script that I had from before. It is attached to this email in case anyone else might find it useful. ~Jason On Wed, 2008-05-21 at 08:10 -0700, Witham, Timothy D wrote: There may be a workaround thought: if you use 'rrdtool dump', you will get a (large) XML file with all of the data. You should be able to then use 'rrdtool restore' to read this back into the new .rrd file. But since you probably have hundreds or thousands of rrd files, you need some automation. So use a script like the below saved as dumprestore. Run dumprestore dump | /bin/sh from your rrd_rootdir on the old system. This will write a .xml for each .rrd. Then rsync all the .xml files to new machine and run dumprestore restore | /bin/sh. After you confirm it is working you can find . -name '*.xml' -exec rm -f {} \; to clean up. -twitham #!/usr/bin/perl use warnings; sub oops { die $0 dump|restore\n; } my $what = shift @ARGV or oops; my $in = 0; $in = 'rrd' if $what eq 'dump'; $in = 'xml' if $what eq 'restore'; oops unless $in; open PIPE, find . -name '*.$in' -print | or die $!; while (PIPE) { chomp; my $in = $_; my $out = $in; $out =~ s/\.rrd$/\.xml/ or $out =~ s/\.xml$/\.rrd/; print rrdtool $what $in , $out =~ /xml$/ ? ' ' : '', $out\n; print touch -r $in $out\n; } - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] strange problem - gmond on headnode reportsdifferent data than sources
On 5/13/2008 at 11:34 PM, in message [EMAIL PROTECTED], Jeremy LaTrasse [EMAIL PROTECTED] wrote: I changed our configs over to unicast, which as seemingly eliminated most of our problems, except one egregious one, and the log files are still being filled with illegal attempt to update using time 1210742641 when last update time is 1210742641 (minimum one second step) The problem seems to be that the gmetad is not able to get new information from headnodes. In the case of 1210742641, the node had not reported to the headnode for 54 sec, and therefore rrdtool on the gmetad could not be expected to update a file with the same information. Output from headnodes for that node confirms. HOST NAME=HOSTNAME.twitter.com IP=X.X.X.X REPORTED=1210742641 TN=54 TMAX=20 DMAX=86400 LOCATION=1 GMOND_STARTED=1210742641 My question now is, why would gmond not be reporting for 54 sec? The load on the machines that are taking longer that 20 sec to report consistently is lower than others in the cluster who report far more frequently. Have you run gmond on that particular node in debug mode to verify that it is gathering and sending data correctly? Have you hit that node directly on port 8649 to make sure that it is generating correct XML output? Have you run your head node gmond in debug mode to verify that it is only receiving packets from the problem node every 54 seconds? Just trying to narrow down whether it is a problem with the node gmond, the head node gmond or gmetad or something in-between. Next, how can I change HOST TMAX if necessary? I've read the gmond.conf man page, the wiki, etc... seems like only location is configurable for host in the gmond.conf. You can't. For some reason TMAX is hardcoded to 20. I'm not sure why. The only way to change it would be to change the source and rebuild gmond. Brad Again, system time is synchronized across all these machines to within .04 seconds. Jeremy On Tue, May 13, 2008 at 3:50 PM, Bernard Li [EMAIL PROTECTED] wrote: Hi Jeremy: On Tue, May 13, 2008 at 1:49 PM, Jeremy LaTrasse [EMAIL PROTECTED] wrote: Where should I be going for comprehensive documentation that describes each of the stanzas in both gmond and gmetad config files, is there one standard document? I can't find one in the sourceforge wiki. For Ganglia 3.0.x, man gmond.conf is your best bet. I checked and it talks about unicast. For gmetad -- the configuration options are pretty straightforward, and the comments in the standard gmetad.conf should be fairly self-explanatory. Cheers, Bernard - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Ganglia reported wrong OS...
On 5/12/2008 at 9:09 AM, in message [EMAIL PROTECTED], Tom Pierce [EMAIL PROTECTED] wrote: Dear Ganglia Users, I upgraded some cluster nodes, from 32 bit OS RHEL4 to 64 bit RHEL5.1, but the ganglia node monitor still seems to remember that the old node (with the same name) was x86 From Ganglia: Software OS: Linux 2.6.18-53.el5 (x86) logging into the node, and using uname -m x86_64- the correct installation.. I deleted the rrd file for this node, but x86 was not stored there... So how can I fix this? Or is it a ganglia bug for going from 32 bits to 64 bits on the same nodename? --- Thanks Tom Did you restart Gmetad? Since it really doesn't make any sense to store constant metrics in an RRD file, all of the constant metrics are stored and reported from memory by Gmetad. I believe that constant metrics are refreshed occasionally and eventually the correct constant metric will show up. However if you want to see it immediately, you will probably have to restart Gmetad. Brad - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] Time to produce a 3.1 beta
The list has quieted down over the last week or so since we release the 3.1 snapshot. This either means that people are busy testing the 3.1 snapshot and haven't had time to respond yet or that things are good and there just isn't much to report. The STATUS file contains one back port proposal that still needs another vote but other than that, I am thinking that it is time for us to produce Ganglia 3.1.1 RC1. Keeping in mind the Release Early, Release Often motto. It's time to get a release out there in the next few weeks. Brad - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] multiple gmetads polling single gmond
Gmond is single threaded. However Gmetad is not when it produces the XML dump. Would it be possible for you to use the Gmetad port rather than hitting Gmond directly? If you hit the Gmetad interactive port you could request data for any of your individual clusters from your script. Multi-threading Gmond output would be a nice thing to have. I guess it hasn't really been an issue in the past because most people only hit Gmond from a single client. Brad Ben Hartshorne [EMAIL PROTECTED] 04/25/08 12:35 PM Hi, I have a rather large set of machines I have ganglia watch (~6000), and am trying to build out a resilient infrastructure. I ran into an interesting problem. I am using gmond version 3.0.2.200511011714 (as reported by --version) Basic layout - each location (~2000 machines) has a pair of hosts to which they send their metrics (unicast). There are a pair of machines that connect to gmond on each of the edge collectors and centralize the data (they connect via TCP to port 8649). We also have another pair of machines that connect to each edge gmond and grab the current XML dump for integration with Nagios (the script is called parse_ganglia for future reference). This worked nicely for quite a while, until one of our edge hosts got too many reportees. There was a connection timeout in parse_ganglia of 5 seconds, so that when one of the edge hosts was down it would move on to the other edge hosts quickly rather than waiting 60s for the down host. When one of the hosts got too many reportees, it started to take ~6s to transfer all the data. At this point, one or the other of the pair of hosts running parse_ganglia started failing on the edge host that had too many reportees. Using tcpdump, I found that though gmond was accepting the connection from both of them, it would only send data to one at a time, and it complete sending data to the first before moving on to the second. so: * host a connects * host a starts getting data * host b connects (3-way handshake complete) but no data flows * host a finishes sending data * host b starts getting data * host b finishes getting data We solved the immediate problem by increasing the timeout from 5 to 15s., but I was a little surprised that gmond behaved in this seemingly-single-threaded manner. While it's easy for us to adjust the timeout in our python parse_ganglia, it is not so easy to poke at gmetad, and I am worried about what will happen when we have variations in network quality, more hosts requesting metrics, etc. Is it true that gmond is single threaded in its network operations? Or maybe just the listener? What other effects might this have? Would it make sense to change gmond so it passes off dumping the XML feed to a child thread so that multiple simultaneous connections can be handled? Thanks for your time, -ben -- Ben Hartshorne email: [EMAIL PROTECTED] http://ben.hartshorne.net - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] Platform experts needed (was:Re: [Ganglia-developers] Ganglia 3.1.x stable branch has been created...)
So here is another request to all you platform experts out there. The Ganglia project will be rolling alpha tarballs of the Ganglia 3.1 version. If the tarball does not work on your platform, please fix it and submit a patch back to the project. Ganglia 3.0.x already works on a variety of platforms and we would like to see 3.1.x work on an equal or greater number. But we need platform experts to make this happen. Here is your chance to jump in and help make Ganglia 3.1 the best release ever. Brad On 4/18/2008 at 4:00 AM, in message [EMAIL PROTECTED], Carlo Marcelo Arenas Belon [EMAIL PROTECTED] wrote: On Thu, Apr 17, 2008 at 03:10:37PM -0700, Bernard Li wrote: I haven't been following issues regarding building on non-Linux architectures like *BSD, Cygwin, Mac OSX, AIX, *BSD used to be able to build and most likely still does. Cygwin builds only as an static build and runs. Mac OSX (older than 10.5) and HPUX could be patched to build but won't work; the rest are unknown but most likely not able to build. Carlo - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Fwd: [Beowulf] Performance metrics reporting
On 4/11/2008 at 1:53 PM, in message [EMAIL PROTECTED], Witham, Timothy D [EMAIL PROTECTED] wrote: So I'd like to ask the Ganglia community -- do you guys find Ganglia to be a resource hog? No. But once I had a couple hundred gmetad processes on a 2GB server. When the size of active processes and RRD files in tmpfs exceeds physical memory, the server begins swapping and can't keep up with the needed polling intervals. Buy more memory or use more gmetad servers to solve this. :-) And I don't really like that the XML is huge and mostly redundant and gets even larger in 3.1. All gmetad needs is name and value to do correct metric rollups. All units and other attributes appear to be ignored, except by the frontend. It would be cool if, for example, we could define what a load_one is in one place, instead of thousands of machines reporting the same exact text every few seconds which seems like a waste of time and bandwidth. I understand the benefits of XML, but perhaps the standard static attributes could be defined in gmetad instead of gmond. This could reduce the XML size considerably and make it more efficient. But this would require a big change; just an idea to think about... I agree that the size of the XML could be reduced in most cases, however it would be impractical to define the metrics in gmeta. The reason why is because of the new metric pluggable modules in 3.1. Since gmond can be extended by plugging in metric modules, there would be no way for gmeta to know about every metric definition that could possibly exist. With the pluggable interface there is no longer just a fixed set of metrics. Any gmond could be gathering metrics about anything. However during the developers meeting in Feb. we talked about an idea where the XML would only contain deltas rather than always sending everything. Somebody just needs to figure out how to make it work. Brad Brad - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Fwd: [Beowulf] Performance metrics reporting
On 4/11/2008 at 4:09 PM, in message [EMAIL PROTECTED], Bernard Li [EMAIL PROTECTED] wrote: Hi Brad: On Fri, Apr 11, 2008 at 3:04 PM, Brad Nicholes [EMAIL PROTECTED] wrote: I agree that the size of the XML could be reduced in most cases, however it would be impractical to define the metrics in gmeta. The reason why is because of the new metric pluggable modules in 3.1. Since gmond can be extended by plugging in metric modules, there would be no way for gmeta to know about every metric definition that could possibly exist. With the pluggable interface there is no longer just a fixed set of metrics. Any gmond could be gathering metrics about anything. How about reducing the amount of XML being sent from a gmetad to an upstream gmetad like what I suggested in this mail? http://www.mail-archive.com/[EMAIL PROTECTED]/msg03941. html Cheers, Bernard Yes, I think we need something like that for gmeta. What I was thinking is to add another filter type. Somthing like ?filter=delta or something like that, that would just project the delta since the last dump. If both gmond and gmetad had a way to reduce the XML by just producing deltas, I think that would speed up the XML parsing a lot and also reduce the writes on the RRDs. Brad - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [Ganglia-developers] Time to create the 3.1.x stable branch...
On 3/13/2008 at 3:46 PM, in message [EMAIL PROTECTED], Brad Nicholes [EMAIL PROTECTED] wrote: On 3/13/2008 at 2:16 PM, in message [EMAIL PROTECTED], Jesse Becker [EMAIL PROTECTED] wrote: On Thu, Mar 13, 2008 at 3:42 PM, Brad Nicholes [EMAIL PROTECTED] wrote: I think that with the removal of the srclib directory from the SVN trunk repository, we have completed everything that we thought needed to be done before creating the 3.1.x stable branch. The only other thing that I know of is testing to make sure that an older 3.0.x gmetad can consume the XML data from a newer 3.1.x gmond. Has anybody had a chance to test this? Were we going to try and make gmond-3.1.x backwards compatable with gmond-3.0.x? Only at the cluster level. In other words, the XML that is produced by a newer gmond should be consumable by an older gmetad and vice-versa. This would allow users to upgrade from 3.0.x to 3.1.x on a cluster by cluster basis. All nodes within a cluster would have to upgrade to gmond 3.1.x at the same time. But the whole grid would not have to upgrade. Theoretically, this should already work. We just need to test it to make sure. So I just set up two clusters here. One that is running pure 3.0.x and another that is running pure 3.1.x with their respective web frontends reporting the cluster data. Then I crossed the two clusters by declaring new data_sources in the gmeta.conf file of each gmetad/web frontend servers. I can hit the web frontend of either server and view the RRD graphs for both clusters. In other words, the 3.1.x gmetad and frontend is able to acquire data from a 3.0.x cluster and a 3.0.x gmetad and frontend is able to acquire data from 3.1.x cluster. So the cluster by cluster migration from 3.0.x to 3.1.x should work just fine. Brad - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [Ganglia-developers] Time to create the 3.1.x stable branch...
On 3/14/2008 at 1:35 AM, in message [EMAIL PROTECTED], Carlo Marcelo Arenas Belon [EMAIL PROTECTED] wrote: On Thu, Mar 13, 2008 at 01:42:05PM -0600, Brad Nicholes wrote: I think that with the removal of the srclib directory from the SVN trunk repository, we have completed everything that we thought needed to be done before creating the 3.1.x stable branch. Agree (sorta), as I was expecting also as part of this removal to have all #ifdefs for compatibility with apr 0.9.x removed and which will otherwise be problematic to support without intrusive changes going forward. Right. I'll get the #ifdef's removed as well. I am proposing that we create the 3.1.x branch on Monday (3/17). Considering how intrusive the changes are it will be better to delay that for a little longer, otherwise we will found ourselves destabilizing the branch with fixes required to get the snapshots in good shape to be used for testing (as ganglia now barely builds and has known problems running in some of the popular supported architectures). OK, then let's get some snapshots tarballs going on TRUNK and get this stuff worked out. If we delay a week, will that be long enough or is there more to it than I am seeing? Once we create the branch, I would suggest that we also take a snapshot tarball and get moving on testing and stabilization. this I suggest we release as our first ever alpha for 3.1.x, but we rather be sure it builds and runs well enough for users to be able to use it and help stabilize the 3.1 release. +1 Brad - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] Time to create the 3.1.x stable branch...
I think that with the removal of the srclib directory from the SVN trunk repository, we have completed everything that we thought needed to be done before creating the 3.1.x stable branch. The only other thing that I know of is testing to make sure that an older 3.0.x gmetad can consume the XML data from a newer 3.1.x gmond. Has anybody had a chance to test this? I am proposing that we create the 3.1.x branch on Monday (3/17). This should give people the weekend to tidy up anything that is left before we create the branch and start working towards a stable 3.1.x release. This will include all of the web frontend changes that have recently been committed to TRUNK. Are there SPEC file changes that need to go in to support the modular web frontend stuff or was that already done? Once we create the branch, I would suggest that we also take a snapshot tarball and get moving on testing and stabilization. We are going to need the help from everybody whether you are a Ganglia developer or just a Ganglia user to make sure that we have a stable 3.1.x release. Of course once we do create the branch, the branch will be under RTC rules http://ganglia.wiki.sourceforge.net/ganglia_works . Any comments, Brad - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] additional info about fsock open error
On 2/15/2008 at 9:34 AM, in message [EMAIL PROTECTED], Mike Olson [EMAIL PROTECTED] wrote: Just an FYI, I have the ports 8649 to 8652 forwarded on my router to my Apache web server. I have looked at the file on line 283 and I don't know what part of that line is creating the error. The line is below: $fp = fsockopen( $ip, $port, $errno, $errstr, $timeout); Thanks, Mike Those ports should be forwarded to gmond (8649) and gmetad (8651,8652). Take a look at your gmond.conf and gmetad.conf file where these ports are configured. If you have those ports forwarded to the Apache server, Apache won't know what to do with the requests. In fact what is happening is that you have just looped back from the PHP code running under Apache to the Apache server itself. Brad - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] new property of a host
On 1/31/2008 at 4:20 PM, in message [EMAIL PROTECTED], Doug Nordwall [EMAIL PROTECTED] wrote: For reference, here is a current HOST line of XML HOST NAME=mybox.local IP=10.1.1.1 REPORTED=1201820930 TN=7 TMAX=20 DMAX=0 LOCATION=unspecified GMOND_STARTED=1200935314 We're looking at making some patches to ganglia to do one of the following things: 1) replace the _value_ with the value of the job_id, and modifying the code so that you can update the location. This would also include the option to keep the location as it is written currently, or spoof it with a job_id (or arbitrary value) 2) add a new field to the HOST entry called job_id and allow it to be updated without restarting ganglia. Sadly, we can't use a job_id a metric (well, we could but it's more of a pain in other part of our code). For the purposes of our machine, a machine is dedicated to a single job. The ultimate idea is to be able to drop whatever we like into that host field and have it change without an entire restart of ganglia. Do any of the main ganglia dev folks have an opinion here? We'd be looking to submit this back into the main stream once we are done. As always, I'm open to be told I'm wrong or it has been done :) I'm not sure that I understand what you are trying to do. From the description, I am assuming that you are trying to associate a job_id with a host. The easiest way to do that would be to just create a metric module that returns a constant job_id string as a metric. Yet, you mentioned that for some reason you can't do that. Could you explain why? Otherwise just adding a job_id attribute to the HOST doesn't sound very useful except in your case. Is there some general benefit to adding a job_id attribute? Brad - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] scaling_max_freq error
On 1/25/2008 at 7:21 PM, in message [EMAIL PROTECTED], Jesse Becker [EMAIL PROTECTED] wrote: On Jan 25, 2008 9:06 PM, Bernard Li [EMAIL PROTECTED] wrote: Hi Jesse: On 1/25/08, Jesse Becker [EMAIL PROTECTED] wrote: Interesting. How about introducing a new metric: cpu_speed_current. For systems without cpuscaling, it is forced to the same value of cpu_speed, or not sent at all, since it is completely redundant. Otherwise, it reports the current CPU speed, as determined by /proc or /sys. I believe the CPU speed is collected once at startup -- what if the CPU speed was clocked down (or up) after gmond started and this current speed was collected -- then it no longer is current. That's my point. The current behavior is collect the speed once, and if it potentially variable, report the maximum speed. That's fine behavior. The other option is to collect this metric periodically, but probably only do this if scaling is enabled. Yep, that works too. It certainly doens't need to be collected/sent frequently. I'd think that once a minute is probably ample. But do people really need this feature? :) Actually, I can think of a good use for it: proving reduced power consumption. If you can point to a chart that shows you are actively throttling CPUs, that's a bonus for the people with clipboards and checkbooks. :-) This is all pretty minor stuff though. Sounds like a fairly simply python module to write. Just check for the scaling_max_freq file existence and then read it and report it. If it doesn't exist then either just return 0 or default to cpuinfo. Any takers? Brad - This SF.net email is sponsored by: Microsoft Defy all challenges. Microsoft(R) Visual Studio 2008. http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general