Re: [Ganglia-general] Ganglia Web top-level project + versioning

2010-11-05 Thread Brad Nicholes
 On 11/4/2010 at 6:21 PM, in message
aanlkti=oxs0t1fbsf9no5og6phqxcbjxuscm1w9kt...@mail.gmail.com, Bernard Li
bern...@vanhpc.org wrote:
 Hi Brad:
 
 [I've changed the subject line to be more reflective of the current 
 discussions]
 
 On Thu, Nov 4, 2010 at 8:50 AM, Brad Nicholes bnicho...@novell.com wrote:
 
 I'm not sure that we need to physically split the web frontend from the 
 backend as far as the Ganglia project goes.  IMO, why not just follow the 
 pattern that we already have in SVN under trunk.  Right now we have 
 trunk/monitor-core which includes everything.  Could we just create a new 
 directory under trunk called web-frontend and move everything that has to do 
 with the web frontend out of monitor-core and into web-frontend.  From that 
 point on, they could both be treated as separate projects with their own 
 release cycles without physically splitting the code into different 
 repositories.  Tagging and branches would also work the same way.
 
 That's fine.
 
 How about versioning?  Or am I thinking too much?  One potential issue
 is that ganglia-core would be at 4.0 and ganglia-web will be at 3.5 --
 this might cause confusion as to what combination is supported, or
 vice versa.
 

As far as versioning goes, I think that ganglia-web would just follow its own 
version scheme.  The frontend might have to include some kind of check on the 
version of the backend to make sure that it is compatible.  I'm not sure how 
flexible the frontend could be, but since all it is doing is consuming XML, I 
am guessing that it could be fairly flexible when it comes to backward 
compatibility.  I am guessing that the most likely scenario is that a user 
would upgrade the frontend a lot more frequently than the backend.  So there 
probably wouldn't have to be much need to worry about an older frontend having 
to support a newer backend.  I think it would be a natural thing for a Ganglia 
user to automatically upgrade the frontend whenever the backend is upgraded.  
But they would probably upgrade the frontend routinely wthout a backend upgrade.

Anyway, yes I think you are thinking too much :-)  Documenting compatibility 
would probably be sufficient.  Of course we as the Ganglia developers, wouldn't 
be able to test every new release of the frontend with every previous release 
of the backend.  But like I said, since the frontend is just consuming XML, it 
should be flexible enough to handle backwards compatibility.  Also the fact 
that the XML schema isn't expected to change, at least no drastically, within a 
major version of the backend, backward compatibility should be simple.

Brad 




--
The Next 800 Companies to Lead America's Growth: New Video Whitepaper
David G. Thomson, author of the best-selling book Blueprint to a 
Billion shares his insights and actions to help propel your 
business during the next growth cycle. Listen Now!
http://p.sf.net/sfu/SAP-dev2dev
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Welcoming newest members to the Ganglia Team!

2010-11-04 Thread Brad Nicholes
As part of the Ganglia development team, I just wanted to add my welcome 
to all of the new committers as well.  It is always great to see so many 
community members wanting to pitch in and help move the project forward.

Brad


 On 11/3/2010 at 11:48 PM, in message
aanlktimpmvlvbl0cykepoay0uhwvkeuc5v-zldqgv...@mail.gmail.com, Bernard Li
bern...@vanhpc.org wrote:
 Dear all:
 
 Hope that you are having a great week so far!  I have some exciting
 news to announce.
 
 We have recently added a few active community members to the Ganglia
 Team to help with upcoming releases and development.  Project
 administrators routinely scout active participants of the Ganglia
 community and invite them to join our ranks to further the project.
 
 Without further ado, please welcome the newest members of our team!
 
 Ben Kero (bkero @IRC) joining us from Mozilla, will be helping out
 with Wiki documentation.
 
 Kostas Georgiou (londo @IRC) is the current Fedora packager and should
 have been added to the team long ago :)
 
 Nicolas Brousse (orieg @IRC) has written the gmond PHP DSO module
 interface and contributed to the gmond Perl DSO module interface.  He
 will act as Release Manager for the upcoming 3.0.8 release.
 
 Erik Kastner (kastner @IRC) from Etsy, and Vladimir Vuksan (vvuksan
 @IRC) from Broadcom have been spearheading the Ganglia Web Frontend
 re-write, will hopefully be giving our frontend some much overdue
 facelift.
 
 There is always more work that needs to be done, so if you like
 Ganglia and would like to help out, please do not hesitate to contact
 us!  You can always help by testing new releases, filing bugs,
 submitting patches, forking our GitHub plugin repositories and
 answering questions on mailing-lists and web forums.
 
 For more information on how the project works, please have a look at
 our Wiki: https://sourceforge.net/apps/trac/ganglia/wiki/how_project_works 
 
 A friendly reminder that Matt and I will be holding a Ganglia BoF at
 LISA '10, details here:
 http://www.usenix.org/events/lisa10/bofs.html#ganglia 
 
 Thanks for your attention and continued support for the project!
 
 Bernard
 -- on behalf of the Ganglia Development Team
 
 --
 The Next 800 Companies to Lead America's Growth: New Video Whitepaper
 David G. Thomson, author of the best-selling book Blueprint to a 
 Billion shares his insights and actions to help propel your 
 business during the next growth cycle. Listen Now!
 http://p.sf.net/sfu/SAP-dev2dev 
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net 
 https://lists.sourceforge.net/lists/listinfo/ganglia-general 




--
The Next 800 Companies to Lead America's Growth: New Video Whitepaper
David G. Thomson, author of the best-selling book Blueprint to a 
Billion shares his insights and actions to help propel your 
business during the next growth cycle. Listen Now!
http://p.sf.net/sfu/SAP-dev2dev
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [Ganglia-developers] IRC chat on Ganglia Web Frontend re-write 10/13/2010 (Wed) 9-10am PDT

2010-11-04 Thread Brad Nicholes
I'm not sure that we need to physically split the web frontend from the backend 
as far as the Ganglia project goes.  IMO, why not just follow the pattern that 
we already have in SVN under trunk.  Right now we have trunk/monitor-core which 
includes everything.  Could we just create a new directory under trunk called 
web-frontend and move everything that has to do with the web frontend out of 
monitor-core and into web-frontend.  From that point on, they could both be 
treated as separate projects with their own release cycles without physically 
splitting the code into different repositories.  Tagging and branches would 
also work the same way.

The only purpose I see for splitting them into two different projects is to try 
to grow two different communities (ie. developers with rights to the web 
project who don't necessarily have rights to the monitor-core project and 
vice-versa).  Given the fact that we don't really have a large developer 
community, I'm not sure that it would be a good idea to split the community 
that we have.

Brad



 On 11/4/2010 at 1:15 AM, in message
aanlktimset-nck4h0wrktf6dyszsdf1uv_mxxslup...@mail.gmail.com, Bernard Li
bern...@vanhpc.org wrote:
 Hi all:
 
 The other day we were talking on IRC regarding how to proceed with
 this re-write effort for the frontend.  In the beginning, I was
 gung-ho on this re-write from scratch, however, recently Vladimir has
 been hacking away adding new features to the existing code in trunk.
 You can get a taste of it here:
 
 http://ec2-184-72-167-114.compute-1.amazonaws.com/ganglia-new/ 
 
 Which got me to thinking...  is a re-write from scratch the best
 approach, or should we just try to keep extending what we have?
 
 Another administrative issue that cropped up, is whether to split out
 Ganglia-Web as a separate project such that it doesn't need to follow
 the main Ganglia release cycle (since the frontend code is usually
 backward/forward compatible with Ganglia releases anyway).
 
 My idea is to create a new project for the frontend, give it a new
 name and start with a new version.  With that, we can tell users that
 after Ganglia X, we will no longer be shipping the web component, use
 Y for that.
 
 Another approach is to retain the Ganglia name, but say that after
 Ganglia 3.2, there are 2 separate projects, ganglia and ganglia-web,
 in which case ganglia-web will be on a different release cycle than
 ganglia.
 
 Sounds confusing?  Yes it is! :)
 
 I don't really care either way, as long as it causes the least
 confusion to the users -- feel free to offer Plan C.
 
 Another plan I have in mind is after we create branch-3.2 from trunk,
 we remove the web component from the code base, in which case all
 future bug fixes to ganglia-web goes into that branch only, and we
 will move development to GitHub (just for the frontend).
 
 Thanks!
 
 Bernard
 
 On Thu, Oct 21, 2010 at 11:29 PM, Bernard Li bern...@vanhpc.org wrote:
 Hi all:

 Sorry for the delay in posting the log, but I have finally uploaded
 it.  Thanks Jesse for logging:

 http://therealms.org/oss/ganglia/ganglia_frontend_rewrite_irc_101310.txt 

 I have left the log as is, I just filtered out people's hostnames and
 stuff.  I chopped off at the end when we started discussing outside
 the scope of the frontend re-write.

 I will try to summarize the log in the next few days, but if anybody
 else who was there would like to take a stab at it, please feel free.

 I think Erik and Vladimir have been hard at work hacking at a Ganglia
 installation on an AWS instance.  We will try to schedule another time
 to sync up and discuss further (would a phone teleconference be better
 this time, or should we stick with IRC)?

 It doesn't look like the hackathon would happen next month.  It might
 become a virtual hackathon but I would really like to put all the
 developers in a room, but anyway, we'll see.

 Thanks again for all who showed up, and for all the great discussions.

 Cheers,

 Bernard

 On Wed, Oct 13, 2010 at 11:52 AM, Jesse Becker haw...@gmail.com wrote:
 I have a log that I will try to clean up and post later today.

 On Wed, Oct 13, 2010 at 14:46, Dave Josephsen d...@dbg.com wrote:
 Hey all,

 Did anyone take minutes?  I wasn't able to attend but am interested in 
 hearing about the chat.

 Thanks

 -dave

 - Original Message -
 From: Bernard Li bern...@vanhpc.org
 To: ganglia-develop...@lists.sourceforge.net, Ganglia 
 ganglia-general@lists.sourceforge.net
 Sent: Thursday, October 7, 2010 1:55:26 PM GMT -06:00 US/Canada Central
 Subject: [Ganglia-general] IRC chat on Ganglia Web Frontend re-write 
 10/13/2010 (Wed) 9-10am PDT

 Dear all:

 I've been talking to people on and off about doing a web frontend
 re-write, in fact I have been thinking about it since almost three
 years ago when I started the wishlist thread:

 
 http://www.mail-archive.com/ganglia-develop...@lists.sourceforge.net/msg03070.h
  
 tml

 I've managed to gather a group of developers and 

Re: [Ganglia-general] Help: Using Ganglia with KVM/QEMU/libvirt ? Any Python DSOs out there?

2010-10-21 Thread Brad Nicholes
 On 10/20/2010 at 8:22 PM, in message
aanlkti=ezmjzo4s6sqyp4m7bdhposhxp2oufagz_z...@mail.gmail.com, Lukas Lundell
lukaslund...@gmail.com wrote:
 Looking to use Ganglia to monitor a virtual linux environment (kvm/qemu).  I
 haven't seen any plugins or Python DSOs for something like libvirt so that
 Ganglia could get information for the virtualized  guest linux
 instances/domains.  Does anyone have any experience doing this?
 
 I would like to go about writing an Python DSO to interface with libvirt if
 there isn't one already out there...
 

I actually wrote a python module a couple of years ago that would collect 
metrics for each of the VMs running on a host.  It would then report the 
metrics through the spoofing functionality of Ganglia so that each of the VMs 
would show up in Ganglia just as if there was a gmond agent running on them 
even through no gmond agent existed in any of the VMs, only on the host.  
Anyway, the disappointing part of this whole exercise is that there are very 
few useful metrics that you can gather through libvirt.  Most of the metrics 
are constants such as how much memory or disk space has been allocated to the 
VM rather than what is the memory utilization, etc.  I haven't revisited this 
module in a couple of years so it might be that there is more useful 
information that can be gathered now.  At the time I was also querying XEN VMs.

Brad


--
Nokia and ATT present the 2010 Calling All Innovators-North America contest
Create new apps  games for the Nokia N8 for consumers in  U.S. and Canada
$10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store 
http://p.sf.net/sfu/nokia-dev2dev
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] How is the multicpu module to be used?

2010-06-16 Thread Brad Nicholes
 On 6/16/2010 at 2:10 PM, in message 20100616201041.ga9...@transpect.com, 
 Whit
Blauvelt w...@transpect.com wrote:
 Hi,
 
 I've compiled ganglia-3.1.7 on CentOS 5.5. The main thing I'm trying to
 monitor on our cluster is load on individual CPU cores. It looks like the
 included multicpu module should do that. I can't find any complete
 decription of how to set that up. What should I be looking at?
 

In the multicpu.conf file there is a note that says that additional metric 
definitions should be added to the .conf file for each discovered cpu on the 
system.  You can get all of the cpu definitions for a system by running gmond 
with a -m parameter.  The output of gmond -m will be a list of all of the 
discovered metrics.  Each metric that begins with multicpu_ will need to be 
defined as a metric in the multicpu.conf file.  Then start gmond normally and 
it should start monitoring all of the cpu metrics for each discovered cpu.

NOTE: the next question is usually, if gmond -m can discover the metrics, why 
do we have to specifically define them as well in the .conf file?  The answer 
to this question has to do with how metrics are enabled through gmond rather 
than just discovered.  There have been previous list discussions about this 
topic.

Brad


--
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] sending float from python module ends up as integer (and very large)

2010-05-07 Thread Brad Nicholes
 On 5/6/2010 at 5:57 PM, in message
l2qd4c731da1005061657xf03acf27x1f1b19b4a7909...@mail.gmail.com, Bernard Li
bern...@vanhpc.org wrote:
 Hi David:
 
 On Thu, May 6, 2010 at 3:39 PM, David Birdsong david.birds...@gmail.com 
 wrote:
 
 i've since just convereted 1.xx seconds to milliseconds and now i'm
 pretty happy with int as a precise enough data type. this doesn't
 explain why the values increment endlessly though when represented as
 floats and simply converting to ints solves the problem.
 
 'value_type' and 'format' needs to match up in the descriptor.  If you
 wanted to do conversion, do it in your handler, it should work that
 way.
 
 However, I'm not sure why if there is a mismatch between 'value_type'
 and 'format' it would generate an ambigious value -- any ides Brad?
 
 Another thing is, instead of letting the user set the 'format',
 shouldn't we just hardcode what they should be?
 

The primary place in the code where the value type and format come together is 
at the point where the value is converted to a string and formatted into the 
XML tag.  In this case, allowing the module to define the format also allows it 
to specify the precision that will ultimately show up in the XML output from 
gmond.  I don't know if that is really a valuable feature or a good enough 
reason to not hardcode the format string for a given value type, but that is 
how it works now.

Another place in the code where value type is very important is when the value 
is pushed through XDR.  This is the process which packages a metric into a very 
small packet which can be passed between systems safely.  In order for XDR to 
create the packet correctly, it has to know exactly what type of data it is 
dealing with.  Otherwise the data will be packaged and unpackaged by XDR using 
the wrong types and who knows what you will end up with after that.

Brad





--

___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] sending float from python module ends up as integer (and very large)

2010-05-07 Thread Brad Nicholes
 On 5/7/2010 at 12:48 PM, in message
r2wd4c731da1005071148t4107614fj661b0e3b5a27a...@mail.gmail.com, Bernard Li
bern...@vanhpc.org wrote:
 Hi Brad:
 
 On Fri, May 7, 2010 at 7:37 AM, Brad Nicholes bnicho...@novell.com wrote:
 
 The primary place in the code where the value type and format come together 
 is at the point where the value is converted to a string and formatted into 
 the XML tag.  In this case, allowing the module to define the format also 
 allows it to specify the precision that will ultimately show up in the XML 
 output from gmond.  I don't know if that is really a valuable feature or a 
 good enough reason to not hardcode the format string for a given value type, 
 but that is how it works now.

 Another place in the code where value type is very important is when the 
 value is pushed through XDR.  This is the process which packages a metric 
 into a very small packet which can be passed between systems safely.  In 
 order for XDR to create the packet correctly, it has to know exactly what 
 type of data it is dealing with.  Otherwise the data will be packaged and 
 unpackaged by XDR using the wrong types and who knows what you will end up 
 with after that.
 
 I have updated the Wiki page with additional information regarding the
 format string for the metric.  Currently I am referencing the Python
 format string format, however, it looks like I should be referencing
 apr_snprintf()'s...?
 
 http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_gmond_python_modules 
 

Right, apr_snprintf() is used to format the string that is used in the XML tag. 
 Basically if the format string is following the printf() guidelines, it is 
good.

Brad


--

___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Problem to include Plugin

2010-04-27 Thread Brad Nicholes
 On 4/27/2010 at 3:20 AM, in message 1272360012.4619.9.ca...@station3.hq,
Patrick Datko patrick.da...@ymc.ch wrote:
 Hey People,
 i'm using Ganglia 3.1.2, installed with aptitude, to observe my cluster
 and it works without any problem. I wanted to integrate a metric which
 control the traffic of the several nodes, so i build a little python
 module to check a xml-file which includes the traffic amount. I used
 sourceforge wiki to build one. I included in my python script the 3
 Methods (traffic_handler, metric_init, metric_cleanup) which are
 required of ganglia and added the following lines
 to /etc/ganglia/gmond.conf
 
 module {
 name = traffic
 language = python
 path = /usr/lib/ganglia/traffic.py
   }
 
 collection_group {
   collect_every = 10
   time_threshold = 50
   metric {
 name = traffic
 title = Traffic
 value_threshold = 70
   }
 }
 
 But if i restard gmond  gmetad the metric still not appears in the
 webinterface of ganglia. Does anyone has a clou where the Problem is or
 maybe has the same Problem like me?
 


Have you run your module independent of gmond to make sure that it is 
functioning correctly?  Have you tried starting gmond with a -d 10 command line 
parameter to force the debug output to the screen?  This will usually show you 
if there is a problem loading the module.

Brad


--
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Problem with custom metrics

2010-04-12 Thread Brad Nicholes
Actually Bernard is the guru here.

thanks Bernard  :)

 On 4/12/2010 at 12:20 PM, in message c7e8dca8.7895%hugo.hernan...@nih.gov,
Hernandez, Hugo (NIH/NIAID) [C] hugo.hernan...@nih.gov wrote:
 Brad,
 Those changes did the trick.
 Thanks a lot!   Now, I can explore my new metrics to be added.
 -Hugo
 
 
 On 4/12/10 2:14 PM, Bernard Li bern...@vanhpc.org wrote:
 
 Hi Hugo:
 
 On Mon, Apr 12, 2010 at 9:51 AM, Hernandez, Hugo (NIH/NIAID) [C]
 hugo.hernan...@nih.gov wrote:
 
 [r...@rocks ~]# python /opt/ganglia/lib64/ganglia/python_modules/hostTemp.py
 value for tempHost is 8
 
 Try renaming your Python file tempHost.py.  That's the name given to
 your module.
 
 You might also want to rename your pyconf tempHost.pyconf for the sake
 of consistency.
 
 Cheers,
 
 Bernard
 
 --
 Hugo R. Hernandez, Contractor
 Dell Perot Systems
 Sr. Systems Administrator
 Mac  Linux Server Team, OCICB/OEB
 National Institutes of Health
 National Institute of Allergy  Infectious Diseases
 10401 Fernwood Drive
 Fernwood West - Rm. 2009
 Bethesda, MD 20817
 
 Phone: 301-841-4203
 Cell: 240-479-1888
 Fax: 301-480-0784
 www.dell.com/perotsystems 
  
 --
 Si seus esforços, foram vistos com indefrença, não desanime, que o sol faze
 un espectacolo maravilhoso todas as manhãs cuando a maior parte das pessoas,
 ainda estam durmindo
 
 - Anónimo brasileiro
 
 Disclaimer: The information in this e-mail and any of its attachments is
 confidential and may contain sensitive information. It should not be used by
 anyone who is not the original intended recipient. If you have received this
 e-mail in error please inform the sender and delete it from your mailbox or
 any other storage devices. National Institute of Allergy and Infectious
 Diseases shall not accept liability for any statements made that are
 sender's own and not expressly made on behalf of the NIAID by one of its
 representatives.
 
 
 
 --
 Download Intel#174; Parallel Studio Eval
 Try the new software tools for yourself. Speed compiling, find bugs
 proactively, and fine-tune applications for parallel performance.
 See why Intel Parallel Studio got high marks during beta.
 http://p.sf.net/sfu/intel-sw-dev 
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net 
 https://lists.sourceforge.net/lists/listinfo/ganglia-general 



--
Download Intel#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.7 ready for testing

2010-03-02 Thread Brad Nicholes
 On 3/2/2010 at 4:23 AM, in message 4b8cf534.7090...@pocock.com.au, Daniel
Pocock dan...@pocock.com.au wrote:

 Thanks to those who provided feedback - any objections to making 3.1.7
 generally available?  I would like to make it GA within the next 1-2
 days now.
 
 

+1


 Michael Perzl wrote:
 I have successfully compiled and tested 3.1.7 on
 - AIX 5.1 ML04
 - AIX 5.3 ML00
 - AIX 5.3 TL07
 - AIX 6.1 TL03

 Regards,
 Michael

 On 02/22/2010 12:15 PM, Daniel Pocock wrote:
   
 Just a reminder - any feedback is welcome, or feel free to discuss 3.1.7
 on IRC

 It would be good to have positive confirmation of which platforms this
 has been tested on, so far, I have tested
 - Debian lenny,
 - RHEL3/4/5,
 - CentOS 5,
   - Solaris 8 and
 - Cygwin.

 and Brad has done some testing on SLES10

 Regards,

 Daniel

 Daniel Pocock wrote:

 
 I've tagged 3.1.7 and built a tarball:

  http://ganglia.info/testing/ganglia-3.1.7.tar.gz 

 The md5sum for 3.1.7 is: 6aa5e2109c2cc8007a6def0799cf1b4c

 Since 3.1.6, only two things have changed and may need to be tested
 again by those who tested 3.1.6:
   - the build system (support for commas in CFLAGS)
   - the multicpu module - percentages reported differently

 This is not confirmation that the release is in GA status - a further
 notification will be sent when the testing period has elapsed without
 any serious defect.  Users are invited to test the tarball and submit
 feedback.

 Please do not commit on branches/monitor-core-3.1 until after 3.1.7
 goes GA, in case further tweaks are needed to facilitate a successful
 release.

 Below are the release notes from the STATUS file.  Other documentation
 has also changed since 3.1.2 and should be reviewed:

 GANGLIA 3.1 STATUS:   -*-text-*-
 Last modified at [$Date: 2010-02-17 11:01:08 + (Wed, 17 Feb 2010) $]

 The current version of this file can be found at:

*
 
 http://ganglia.svn.sourceforge.net/svnroot/ganglia/branches/monitor-core-3.1/ST
  
 ATUS

 Release history:

  3.1.7 : Tagged: Feb 17, 2010
  3.1.6 : Tagged: Feb  4, 2010 (not released for GA)
  3.1.5(hargrave)   : Tagged: Nov 24, 2009 (not released for GA)
  3.1.4(hargrave)   : Tagged: Oct 26, 2009 (not released for GA)
  3.1.3(avenger): Tagged: Sep 19, 2009 (not released for GA)
  3.1.2(langley): Released: Feb 17, 2009
  3.1.1(wien)   : Released: Sep 10, 2008
  3.1.0(amelia) : Released: Jul 30, 2008

 Contributors looking for a mission:

* Just do an egrep on TODO, XXX or FIXME in the source.
* Review the bug database at: http://bugzilla.ganglia.info/ 
* Open bugs in the bug database.
* Implement a feature from the wishlist at:
 http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_wish-list 

 CURRENT RELEASE NOTES:
(Please update this area with a brief description of bug fixes and
 enhancements that have been backported for the current release)

Note: 3.1.3, 3.1.4, 3.1.5 and 3.1.6 never became GA, therefore,
the release notes for all of them are combined below.

3.1.7:

* Fix build support for RHEL5/issue with commas in CFLAGS
* multicpu module: show CPU utilization as a value between 0-100% for
  each core

3.1.6:

* Merge commit 1966 from trunk to fix contrib/removespikes.pl
* Bootstrapping with Debian 5.0 (lenny) versions of autotools for
  this and future releases.

 
 http://www.mail-archive.com/ganglia-develop...@lists.sourceforge.net/msg05352.h
  
 tml

 
 http://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg04688.html
  
 
* Require user to explicitly specify sysconfdir when building from
 source,
  due to the fact that the old behavior was not consistent with the
  documented behavior.
* Configuration files and scripts are now created during the install
 phase
  rather than during configure.   This allows values such as
 @sysconfdir@
  to be used in the template configuration files.
* Abolish the use of release names - only release numbers will be used
  to distinguish versions in future
* libmetrics: workaround system header conflict in DFBSD= 2.4 (BUG245)
* Use PCRE regex matching to configure metrics using the name_match
 directive
* rrdcached support
* gmetad now uses apr and the sleep intervals between polls are
 randomized
  in a way that supports shorter polling intervals
* FreeBSD support: fixes for crashes and disk statistics (BUG153)
* Further tweaks to Solaris build support (remove C99 hack)
* Eliminate conflict with ncpus symbol name on older Solaris
* AIX support: determine if the host is a virtual server (BUG226)
* AIX support: setting linker flags (BUG227), add -lm
* AIX support: tweaks for AIX= v6.1
* AIX support: revised init scripts for gmond and gmetad
* Check for Python.h explicitly
* Include the necessary Python files in the 

Re: [Ganglia-general] gmetad and RDD problem

2010-02-10 Thread Brad Nicholes
 On 2/10/2010 at 1:36 AM, in message
70933b58740d5049a7ab96254a66683301a0e...@yaca.intra.cea.fr, GOGUEY-MUETHON
Nicolas OSIATIS nicolas.goguey-muet...@cea.fr wrote:
 Hello ,
 
  
 
 I have lot of log with error like this:
 
  
 
 Feb 10 09:29:52 SERVEUR /usr/sbin/gmetad[22332]: RRD_update 
 (/var/lib/ganglia/rrds/NOEUDS/__SummaryInfo__/part_max_used.rrd): illegal 
 attempt to update using time 1265790592 when last update time is 1265790592 
 (minimum one second step)
 
  

When I have seen this problem it is because the system time on one of your 
monitored nodes is ahead of the system time of the machine that is running 
gmetad.  Make sure that the system time for all of your machines is in sync.

Brad


--
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.5 beta ready for final testing

2009-12-02 Thread Brad Nicholes
 On 12/2/2009 at 7:21 AM, in message 4b1677e4.8000...@pocock.com.au, Daniel
Pocock dan...@pocock.com.au wrote:
 I would like gmond to return a non-zero return code if it fails to 
 initialise, e.g. if it is unable to bind or if it is unable to resolve a 
 hostname mentioned in gmond.conf
 
 Otherwise, the init-script always says that it started '[OK]' even if 
 the daemon process has died on startup.
 
 That is why this change was made.  However, I see a few solutions going 
 forward:
 
 - we can discard the patch completely
 
 - we can discard the patch, and I could write another patch that does 
 some tests (e.g. resolving host names) before daemonizing
 
 - we can #ifdef the patch so that on BSD systems, it daemonizes earlier, 
 and on other systems it does so later
 
 - we can modify the init script to sleep and then call `ps -C gmond' and 
 determine if it kept running
 
 - post the problem on the apr dev list and discuss it there before 
 making any decision
 
 

I'm not sure that I have anything to add as far as the discussion of this issue 
goes, but I have commit rights on the APR project.  If you go with the last 
option and take this discussion to the APR-dev list, I can certainly get 
whatever patch is agreed upon committed and backported in APR.  The downside to 
that option is that we would have to bundle the latest APR RPMs or tarball with 
Ganglia rather than using the distro version.  So even if we do find a solution 
in APR, we will probably still have to build in a workaround in gmond.

Brad


--
Join us December 9, 2009 for the Red Hat Virtual Experience,
a free event focused on virtualization and cloud computing. 
Attend in-depth sessions from your desk. Your couch. Anywhere.
http://p.sf.net/sfu/redhat-sfdev2dev
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [Ganglia-developers] 3.1.4 to go GA?

2009-11-20 Thread Brad Nicholes
 On 11/20/2009 at 8:07 AM, in message 4b06b0af.1050...@pocock.com.au, 
 Daniel
Pocock dan...@pocock.com.au wrote:
 Brad Nicholes wrote:
 I've been running it on a very small set of machines.  It all looks good to 
 me.
   
 
 No complaints from anyone... is that sufficient to go live?  I'm not 
 sure if I have the access level to put the release on the SF site though.

You are the release manager.  The decision to go live is your call.  :)

Brad


--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [Ganglia-developers] 3.1.4 to go GA?

2009-11-18 Thread Brad Nicholes
I've been running it on a very small set of machines.  It all looks good to me.

Brad

 On 11/18/2009 at  9:42 AM, in message
d4c731da0911180842x74ecc2c3p2f440e9c521d7...@mail.gmail.com, Bernard Li
bern...@vanhpc.org wrote: 
 I haven't had a chance to test it out yet -- has anybody else been
 able to give it a spin?
 
 Cheers,
 
 Bernard
 
 On Wed, Nov 18, 2009 at 7:22 AM, Daniel Pocock dan...@pocock.com.au wrote:


 How do people feel about making 3.1.4 GA?


 --
 Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
 trial. Simplify your report design, integration and deployment - and focus on
 what you do best, core application coding. Discover what's new with
 Crystal Reports now.  http://p.sf.net/sfu/bobj-july
 ___
 Ganglia-developers mailing list
 ganglia-develop...@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-developers

 
 --
 Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
 trial. Simplify your report design, integration and deployment - and focus on 
 
 what you do best, core application coding. Discover what's new with
 Crystal Reports now.  http://p.sf.net/sfu/bobj-july
 ___
 Ganglia-developers mailing list
 ganglia-develop...@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-developers




--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia cannot find a data source.

2009-11-17 Thread Brad Nicholes
 On 11/17/2009 at 10:04 AM, in message
b1eec58d0911170904r2f2613ads9244341a82b85...@mail.gmail.com, Ryan Robertson
89esp...@gmail.com wrote:
 I too have been bangin my head on this for a few weeks.  After much googling
 i cannot seem to find the answer, so i hope someone (developer maybe) can
 help.
 
 
 I was successfully using ganglia 2.5 and 3.0.x.  At some point i upgraded to
 3.1.x and things went sour.  I've even tried to revert back to a known
 working condition to no avail.  So here's my current setup.
 
 GMETAD 3.1.4 running under suse 11.1 ppc.  Using a basic gmetad.conf file
 monitoring itself (localhost) for troubleshooting purposes:
 ---snip from /etc/gmetad.conf ---
 data_source my cluster localhost gpipnim01
 data_source sap_app gpiptcpeap02
 ---snip-
 XML on localhost seems fine.  I can telnet to localhost 8469 and get proper
 results.  FWIW :
  GANGLIA_XML VERSION=3.1.4 SOURCE=gmond
 
 RRD's are updating properly in /var/lib/ganglia/rrds/
 
 gmond (on localhost) in debug mode is sending updates (obviously since RRD's
 are being created).  gmond -m shows modules are loaded.
 
 Web frontend:
 When I hit the webpage i get 
 Ganglia cannot find a data source. Is gmond running?
 

When you telnet to 8652 what do you get?  Localhost 8649 is the output from 
gmond on localhost.  Localhost 8652 is the interactive port from gmetad which 
is the port that the web frontend uses to get the metric data.

Brad


--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia cannot find a data source.

2009-11-17 Thread Brad Nicholes
Sounds to me like it could be a file permissions problems then.  Is your apache 
server able to access the rrd files and/or port 8652?



 On 11/17/2009 at  1:00 PM, in message
0016e64c2536e598710478969...@google.com, 89esp...@gmail.com wrote: 
 Ahh yes, i knew there was one other telnet snippet question. I am able to  
 telnet to localhost 8652 and feed it
 /?filter=summary
 
 I get outputthe output scrolled off the screen, but you get the idea  
 that it's returning...
 
 --snip-
 /METRICS
 METRICS NAME=swap_total SUM=2019320 NUM=1 TYPE=double UNITS=KB  
 SLOPE=zero SOURCE=gmond
 EXTRA_DATA
 EXTRA_ELEMENT NAME=GROUP VAL=memory/
 EXTRA_ELEMENT NAME=DESC VAL=Total amount of swap space displayed in  
 KBs/
 EXTRA_ELEMENT NAME=TITLE VAL=Swap Space Total/
 /EXTRA_DATA
 /METRICS
 METRICS NAME=part_max_used SUM=40.2 NUM=1 TYPE=double UNITS=%  
 SLOPE=both SOURCE=gmond
 EXTRA_DATA
 EXTRA_ELEMENT NAME=GROUP VAL=disk/
 EXTRA_ELEMENT NAME=DESC VAL=Maximum percent used for all partitions/
 EXTRA_ELEMENT NAME=TITLE VAL=Maximum Disk Space Used/
 /EXTRA_DATA
 /METRICS
 /CLUSTER
 /GRID
 /GANGLIA_XML
 snip




--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia install instructions wiki link broken

2009-11-16 Thread Brad Nicholes
 On 11/12/2009 at 8:57 AM, in message 4AFC3066.521 : 172 : 26400, Brad
Nicholes wrote:
  On 11/12/2009 at 6:12 AM, in message
 f7b2d28a-290a-4142-8f13-6034d55c2...@beforedawnsolutions.com, John Martyniak
 j...@beforedawnsolutions.com wrote:
  First off is that the best way to install Ganglia?  Is there a better  
  way? Better instructions?  I checked the documentation on ganglia.org  
  and the wiki links are broken, so couldn't get any install  
  instructions there.
  
 
 I ran into this myself the other day.  The document links on our wiki page 
 (http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_gmond_con 
 figuration) are tied to the sourceforge docman which I think, went away.  
 Therefore the links to the install docs are now broken.  At one point I had 
 put the latest ganglia and gmond installation and configuration .html pages 
 in docman so that we could reference them from the wiki.  Since that is now 
 broken, is there some where else where we can but the install and 
 configuration pages?  These documents could possibly change with each release 
 of Ganglia.  Since they are generated docs that are included in the distro 
 tarball, they would need to be uploaded to the wiki document reference 
 location after each release.
 
 Brad
 

I think I have all of the wiki page links fixed up.  Especially on the 
installation and configuration page.  I also fixed up some links to the misc. 
documents about Ganglia and monitoring.  If anyone discovers any other broken 
links on the wiki, please let me know.

Brad


--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] XML errors: XML_ParseBuffer() error at line 272: not well-formed(invalid token)

2009-11-13 Thread Brad Nicholes
 On 11/12/2009 at 8:11 PM, in message
b791204d0911121911t5628f609s88f339567d104...@mail.gmail.com, chifeng
chif...@gmail.com wrote:
 Hi folks,
 
 I got a XML errors in Ganglia v3.1.2. It looks like this ticket:
 http://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg05054.htm
  
 l
 
 Here is my errors information:
 
 Nov 12 13:43:18 labmonitor /usr/local/ganglia/sbin/gmetad[14362]: Process
 XML (BJQA1): XML_ParseBuffer() error at line 499: not well-formed (invalid
 token)

What would be nice is to change the error message in ganglia.php so that it 
produced the actual XML line or better yet, dumped the whole parse buffer to a 
file.  I have heard of this problem happening with later versions of apache 
mod_php.  Since it seems to happen sporadically and everything is being 
processed in memory, it would be nice to see exactly what the xml parser is 
complaining about or if this is a bug in mod_php.

Brad 


--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] special metric names in diskusage.pyconf file

2009-10-22 Thread Brad Nicholes
No, there is no solution for this yet.  The problem is that gmond does not yet 
provide a way for metrics to be dynamically defined.  Every metric has to be 
defined through the configuration file first.  I'm sure that there a many 
solutions to this problem, but nobody has stepped up to tackle this one yet.  I 
don't believe that this one will be a quick fix.  Everything is designed around 
full configuration through the config file.  Solving this problem will require 
some architectural redesign.

Brad

 On 10/21/2009 at 9:54 AM, in message
68fea9390910210854r3b116748j9c33a7f4c5ac4...@mail.gmail.com, Matt
mattmora...@gmail.com wrote:
 Is there any solution to this?
 
 It would be really beneficial to work out the metrics we want to
 publish in the python code rather than supplying them up front in the
 pyconf file.
 
 2009/7/15 Brad Nicholes bnicho...@novell.com:
 On 7/14/2009 at 4:36 PM, in message
 4120cbd6bbd82647b89d6a70694510bed1c...@exchange02.presidio.alexa.com, 
 Guolin
 Cheng guo...@alexa.com wrote:
 Hi,



  Any one knows what the metric name disk_used-metric-name stands
 for? The stanza is from diskusage.pyconf file, ganglia version 3.1.1/2.



 collection_group {

 ...

   metric {

 name = disk_used-metric-name

 ...

   }

 ...



  It looks like that the name stands for a series of metrics output from
 associated python module, but not sure what is the playing rule behind.
 Any one can shed a light into this? Thanks a lot.




 It is a place holder for the actual metric name that isn't determined until 
 you run gmond.  Seems like a chicken and egg thing, but what you have to do 
 is run gmond with the -m parameter first.  This will give you a list of all 
 of 
 the possible metrics including those that come from diskusage.py.  Then 
 extract the actual diskusage metric names from the list and plug them into 
 the .conf file.  In most cases you will have to create additional metric 
 blocks in the .conf file for each diskusage metric.  Then start gmond 
 normally and the individual disk usage metrics will be collected as expected.

 Brad


 --
 Enter the BlackBerry Developer Challenge
 This is your chance to win up to $100,000 in prizes! For a limited time,
 vendors submitting new applications to BlackBerry App World(TM) will have
 the opportunity to enter the BlackBerry Developer Challenge. See full prize
 details at: http://p.sf.net/sfu/Challenge 
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net 
 https://lists.sourceforge.net/lists/listinfo/ganglia-general 

 
 --
 Come build with us! The BlackBerry(R) Developer Conference in SF, CA
 is the only developer event you need to attend this year. Jumpstart your
 developing skills, take BlackBerry mobile applications to market and stay 
 ahead of the curve. Join us from November 9 - 12, 2009. Register now!
 http://p.sf.net/sfu/devconference 
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net 
 https://lists.sourceforge.net/lists/listinfo/ganglia-general 




--
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia 3.1.3 beta ready for testing

2009-10-01 Thread Brad Nicholes
 On 10/1/2009 at 4:33 PM, in message
d4c731da0910011533p2d337d0ajc80ea158d2a7...@mail.gmail.com, Bernard Li
bern...@vanhpc.org wrote:
 So has anybody else given 3.1.3 a test run?
 
 I have found some minor issues.
 
 It looks like there are new configure options added in regards to
 setuid and setgid:
 
   --enable-debug  turn on debugging output and compile options
   --enable-gexec  turn on gexec support (platform-specific)
   --enable-setuid=USER  turn on setuid support (default setuid=nobody)
   --enable-setgid=GROUP  turn on setgid support (default setgid=daemon)
 
 There are 2 issues:
 
 - extra quotation marks in the text
 - --enable-setuid is OFF by default.  This is the opposite behaviour
 from previous released versions
 
 On top of that, our spec file has not been updated with this new
 configure option and therefore the RPMs I posted do *not* setuid.
 
 I'm not sure if we should consider this as show stopper, but a simple
 fix would simply be to change the default configure option so that it
 reflects the previous behaviour.
 
 Please let me know what you guys think.
 

If this is just a simple fix, then I would vote for scraping 3.1.3, rolling 
3.1.4 with the fix and resetting the test period.  The other option, since this 
isn't a regression, would be to release 3.1.3 as is with the defect noted in 
the release notes.  Then release 3.1.4 next month with the fixes.  I would vote 
for the first option, but I'm OK with the second if that is the way everybody 
else wants to go.

Brad


--
Come build with us! The BlackBerryreg; Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9#45;12, 2009. Register now#33;
http://p.sf.net/sfu/devconf
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] How to remove a gmetric-added metric?

2009-09-02 Thread Brad Nicholes
No, but you should be able to get the same results by setting host_dmax
in the gmond.conf file.

Brad

 On 9/2/2009 at 1:37 AM, in message
68fea9390909020037y2094d15es42bd13da3ea0...@mail.gmail.com, Matt
mattmora...@gmail.com wrote:
 Is there such a thing as dmax in the python interface?
 
 2009/9/2 Rick Cobb rc...@quantcast.com:
 And just to keep the ganglia-general list hopping: is there any
reason the
 default for gmetric is to send with a dmax of *infinity* (0 is
the syntax
 for that)?  We*ve patched ours internally to default to  60s tmax,
600s dmax
 for exactly the reason that people often have bugs in their gmetric
scripts,
 and trying to work around dead metrics on lots of machines is
painful.

 -- ReC


 On 9/1/09 11:47 PM, Rick Cobb rc...@quantcast.com wrote:

 That sentence should have read *once the dmax passes, gmond will
stop
 sending the metric.* Eventually, gmetad also considers the metric
to have
 expired and stops reporting it to the GUI, so the chart goes away.
Then all
 you need to do is remove the rrdtool file.

 Sorry for the premature *send* --
 -- ReC


 On 9/1/09 6:34 PM, Rick Cobb rc...@quantcast.com wrote:

 If you*re just trying to get it to expire from your gmond/gmetad
XML, just
 send it again with a non-zero tmax  dmax.  Once the dmax passes,
the

 Once the graph disappears remove the rrdtool file using *rm*.

 -- ReC


 On 9/1/09 8:10 AM, Raimund Eimann raim...@local.ch wrote:

 Hi,

 is there an easy way to get rid of a metric that was added with
gmetric in
 the first place?

 Cheers,
 Raimund





--
 Let Crystal Reports handle the reporting - Free Crystal Reports 2008
30-Day
 trial. Simplify your report design, integration and deployment - and
focus
 on
 what you do best, core application coding. Discover what's new with
 Crystal Reports now.  http://p.sf.net/sfu/bobj-july 
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net 
 https://lists.sourceforge.net/lists/listinfo/ganglia-general 


 

--
 Let Crystal Reports handle the reporting - Free Crystal Reports 2008
30-Day 
 trial. Simplify your report design, integration and deployment - and
focus on 
 
 what you do best, core application coding. Discover what's new with 
 Crystal Reports now.  http://p.sf.net/sfu/bobj-july 
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net 
 https://lists.sourceforge.net/lists/listinfo/ganglia-general 



--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] python metric modules

2009-09-02 Thread Brad Nicholes
  Why can't you just do the following:

# webserver.pyconf
modules {
  module {
name = lsof
language = python
param httpd {
  value = doesnt-matter
}
param crawler {
  value = doesnt-matter
}
  }
}

collection_group {
  collect_every = 30
  time_threshold = 60

  metric {
name = lsof-httpd
title = Web Server open files
value_threshold = 1.0
  }

  metric {
name = lsof-crawler
title = crawler open files
value_threshold = 1.0
  }
}

Then just iterate through all of the params that your module receives and 
construct a unique metric name by appending the param name to the module name.  
Then create a separate descriptor for each unique metric name and pass that 
back.  Gmond will generate a metric for each of the descriptors and call your 
module to gather each one individually.  This way you have one python module 
that is loaded once but generates metrics for each process that you specify in 
the config file.

Brad

 On 9/2/2009 at 2:15 AM, in message
68fea9390909020115y1485f0b2j8056ae00b36be...@mail.gmail.com, Matt
mattmora...@gmail.com wrote:
 Hi all,
 
 I'm trying to create my python metric modules as versatile as
 possible, but i'm not sure exactly how they are supposed to be used.
 For instance, i've got a python script that grabs open files for a
 process.  Can I invoke this module multiple times with different
 pyconf files? and if so, how do I change the metric name? I don't
 really want to use multiple descriptors, as I don't want to check for
 processes I know that are not on the server.  example:
 
 The lsof.py script has a descriptor for lsof, can I use the same
 lsof.py for two pyconf files? won't the metrics both show up as lsof?
 Or do I need a different lsof.py for every process/metric I want to
 capture?
 
 # webserver.pyconf
 modules {
   module {
 name = lsof
 language = python
 param name {
   value = httpd
 }
   }
 }
 
 collection_group {
   collect_every = 30
   time_threshold = 60
   metric {
 name = lsof
 title = Web Server open files
 value_threshold = 1.0
   }
 }
 
 #crawler.conf
 modules {
   module {
 name = lsof
 language = python
 param processname {
   value = crawler
 }
   }
 }
 
 collection_group {
   collect_every = 30
   time_threshold = 60
   metric {
 name = lsof
 title = crawler open files
 value_threshold = 1.0
   }
 }
 
 Thanks,
 
 Matt
 
 --
 Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
 trial. Simplify your report design, integration and deployment - and focus on 
 
 what you do best, core application coding. Discover what's new with 
 Crystal Reports now.  http://p.sf.net/sfu/bobj-july 
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net 
 https://lists.sourceforge.net/lists/listinfo/ganglia-general 




--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] issues with upgrade to 3.1

2009-08-27 Thread Brad Nicholes
As Bernard mentioned, take a look at the upgrade release notes

 Please see the section Upgrading from 3.0 in the 3.1.x release notes:

 http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_release_notes 

you can't mix gmond 3.0.x and gmond 3.1.x in the same cluster.  All of the 
gmond nodes within a cluster have to be upgraded at the same time.

Brad


 On 8/27/2009 at 2:33 PM, in message 4a96edad.5070...@phys.ufl.edu, Yu Fu
y...@phys.ufl.edu wrote:
 I have just upgraded the frontend with gmetad to 3.1.2, but the Ganglia 
 web display is still the same. That is, the data of nodes with Ganglia 
 3.1.2 never got changed/updated with constant straight lines displayed 
 in the plots while nodes with Ganglia 3.0.7 are just fine in the plots. 
 What is wrong?
 
 Thanks,
 
 Yu
 
 
 Yu Fu wrote:
 No, I only upgraded the compute nodes. The machine running gmetad is 
 still old 3.0.7.

 Yu

 Bernard Li wrote:
   
 Hi Yu:

 On Thu, Aug 27, 2009 at 9:23 AM, Yu Fuy...@phys.ufl.edu wrote:

   
 
 I upgraded Ganglia from 3.0.7 to 3.1.2 on some compute nodes while the
 ones on other nodes keep unchanged. I know the gmond.conf has changed in
 3.1 so I created new gmond.conf for those upgraded machines from a
 template obtained from gmond -t. Things looked fine until I noticed
 that the upgraded nodes' data never changed/updated in the Ganglia plots
 even after clicks on Get Fresh Data button. As a result, all plots of
 upgraded nodes just show a flat straight line with initial values. Is
 this expected or something wrong? Do I have to upgrade all nodes at the
 same time?
 
   
 Have you upgraded your gmetad to 3.1.2 as well?

 Please see the section Upgrading from 3.0 in the 3.1.x release notes:

 http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_release_notes 

 Cheers,

 Bernard
   
 

 --
 Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
 trial. Simplify your report design, integration and deployment - and focus 
 on 
 
 what you do best, core application coding. Discover what's new with 
 Crystal Reports now.  http://p.sf.net/sfu/bobj-july 
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net 
 https://lists.sourceforge.net/lists/listinfo/ganglia-general 
   
 
 --
 Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
 trial. Simplify your report design, integration and deployment - and focus on 
 
 what you do best, core application coding. Discover what's new with 
 Crystal Reports now.  http://p.sf.net/sfu/bobj-july 
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net 
 https://lists.sourceforge.net/lists/listinfo/ganglia-general 




--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] ganglia returns wrong value for python module

2009-08-20 Thread Brad Nicholes
 On 8/20/2009 at 9:01 AM, in message
68fea9390908200801m3e1f43ecy2c33e743ccc0d...@mail.gmail.com, Matt
mattmora...@gmail.com wrote:
 Hi all,
 
 I'm getting inconsistent results when gmond is running my python module
 
 # gmond --version
 gmond 3.1.2
 
 # ./lsof.py
 (0, '2699')
 2699
 2700
 1000
 
 # gmond -d1
 (0, '535')
 551
 552
 1000
 
 --- python code ---
 #!/usr/bin/python
 
 import commands
 
 def lsof_handler(name):
 cmd = 'lsof -p 3123 | wc -l'
 lsof = commands.getstatusoutput(cmd)[1]
 print commands.getstatusoutput(cmd)
 print lsof
 print int(lsof) + 1
 lsof = 1000
 print lsof
 
 Any ideas? i've got a similar function that uses
 commands.getstatusoutput() to grab the pid of a process which works
 fine.
 

I would suggest adding something like the following to your python module and 
then running it as a stand alone script outside of gmond to see what your 
module is returning.  I haven't see any problems like this with any other 
python module.


#This code is for debugging and unit testing
if __name__ == '__main__':

# If your module expects configuration parameters, adjust
# this line to match the expected parameters
params = {'RandomMax': '500',
'ConstantValue': '322'}
# Manually call the metric_init function
metric_init(params)

# Manually call the callback that where defined in the 
# metric descriptor.  Then prin the return values.
for d in descriptors:
v = d['call_back'](d['name'])
print 'value for %s is %u' % (d['name'],  v)




--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] special metric names in diskusage.pyconf file

2009-07-14 Thread Brad Nicholes
 On 7/14/2009 at 4:36 PM, in message
4120cbd6bbd82647b89d6a70694510bed1c...@exchange02.presidio.alexa.com, Guolin
Cheng guo...@alexa.com wrote:
 Hi,
 
  
 
  Any one knows what the metric name disk_used-metric-name stands
 for? The stanza is from diskusage.pyconf file, ganglia version 3.1.1/2.
 
  
 
 collection_group {
 
 ...
 
   metric {
 
 name = disk_used-metric-name
 
 ...
 
   }
 
 ...
 
  
 
  It looks like that the name stands for a series of metrics output from
 associated python module, but not sure what is the playing rule behind.
 Any one can shed a light into this? Thanks a lot.
 
  


It is a place holder for the actual metric name that isn't determined until you 
run gmond.  Seems like a chicken and egg thing, but what you have to do is 
run gmond with the -m parameter first.  This will give you a list of all of the 
possible metrics including those that come from diskusage.py.  Then extract the 
actual diskusage metric names from the list and plug them into the .conf file.  
In most cases you will have to create additional metric blocks in the .conf 
file for each diskusage metric.  Then start gmond normally and the individual 
disk usage metrics will be collected as expected.

Brad


--
Enter the BlackBerry Developer Challenge  
This is your chance to win up to $100,000 in prizes! For a limited time, 
vendors submitting new applications to BlackBerry App World(TM) will have
the opportunity to enter the BlackBerry Developer Challenge. See full prize  
details at: http://p.sf.net/sfu/Challenge
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmond returning XML with large negative TNvalues(ganglia 3.1.2, linux x86_64)

2009-07-13 Thread Brad Nicholes
 On 7/13/2009 at 1:06 AM, in message
d9c3f61a0907130006q5cdf7d8fg85ed8ea7f7ea3...@mail.gmail.com, Pavel Shevaev
pacha.shev...@gmail.com wrote:
 Hi folks,
 
 Looks like gmetad ignores reports from gmond returning records with
 large negative TN values.
 gmond started to behave like that after the computer was restarted.
 
 Here's a sample of gmond's output acquired with nc localhost 8649:
 
 GANGLIA_XML VERSION=3.1.2 SOURCE=gmond
 CLUSTER NAME=host1 LOCALTIME=1247467796 OWNER=BIT
 LATLONG=unspecified URL=unspecified
 HOST NAME=localhost IP=127.0.0.1 REPORTED=1247478928
 TN=-11132 TMAX=20 DMAX=0 LOCATION=unspecified
 GMOND_STARTED=1247478927
 METRIC NAME=tcp_closed VAL=0 TYPE=uint32 UNITS=Sockets
 TN=-11143 TMAX=20 DMAX=0 SLOPE=both
 ...
 /METRIC
 
 I believe these large negative TN values somehow make gmetad, gstat,
 etc  think the host is down. Here's what gstat says:
 
 CLUSTER INFORMATION
   Name: host1
   Hosts: 0
 Gexec Hosts: 0
  Dead Hosts: 1
   Localtime: Mon Jul 13 10:55:02 2009
 
 But gmond is definitely alive, here's some output from strace:
 
  $ sudo strace -p 15911
 Process 15911 attached - interrupt to quit
 epoll_wait(3, {{EPOLLIN, {u32=7117640, u64=7117640}}}, 2, 10627587) = 1
 accept(5, {sa_family=AF_INET, sin_port=htons(42589),
 sin_addr=inet_addr(192.168.4.10)}, [140733193388048]) = 7
 write(7, ?xml version=\1.0\ encoding=\ISO..., 2489) = 2489
 write(7, GANGLIA_XML VERSION=\3.1.2\ SOUR..., 45) = 45
 
 After restarting gmond everything becomes fine.
 
 Any ideas on what can be the reason of such a strange behavior?


The only thing that I know of that would cause this behavior is if the system 
clocks on your various node are out of sync.  TN report the time stamp offset 
between the time that the metric was actually gathered and the time that it is 
being reported to gmetad.  If the system clock on the node that is gathering 
the metric is ahead of the system clock on the node that is reporting the 
metrics to gmetad, the calculation that determines the TN can go negative.  
Check to make sure that all of the system clocks on the nodes running gmond are 
all in sync.

Brad


--
Enter the BlackBerry Developer Challenge  
This is your chance to win up to $100,000 in prizes! For a limited time, 
vendors submitting new applications to BlackBerry App World(TM) will have
the opportunity to enter the BlackBerry Developer Challenge. See full prize  
details at: http://p.sf.net/sfu/Challenge
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] No rrd file being created for metrics

2009-06-30 Thread Brad Nicholes
You mentioned the udp_send_channel configuration but did you set up the 
udp_recv_channel?  Gmond has to be able to listen to itself as well as 
everybody else in order to collect the metrics that will be reported to gmetad.

Brad

 On 6/30/2009 at 12:46 AM, in message
93f24cb40b05984b8c87edfe5773374d3caa95b...@lonmc01010.rbsres07.net,
wayne.pas...@rbs.com wrote:
 All,
 
 Sorry to bump this, but does anyone have any ideas on a solution for this ? 
 
 Regards,
 
 Wayne Pascoe
 RBS Global Banking  Markets
 Office: +44 20 3361 9571   |  Mobile: +44 7799 707450 
  
 
 -Original Message-
 From: PASCOE, Wayne, GBM 
 Sent: 26 June 2009 11:24
 To: ganglia-general@lists.sourceforge.net 
 Subject: Re: [Ganglia-general] No rrd file being created for metrics
  
 Thanks for your reply - hopefully this will clarify the 
 missing bits of information.
  
  -Original Message-
 
  Your email is missing some relevant information.  gmond 
 doesn't store 
  metrics into RRD files, it only makes the metric 
 information available 
  via XML.  gmetad is the process that gets these metrics (from gmond
  nodes) and stores them on RRD files on disk.  You seem to 
 be debugging 
  under the assumption that your problem is that the collector gmond 
  isn't getting the metrics from the other gmond, but you 
 haven't given 
  us information to indicate that you've verified that.
  
 To clarify, I will call our regular collecter 'BoxA' and the 
 host with the disk metrics 'WinBox'. BoxA is a Linux host 
 running Ganglia 3.0.7 and WinBox is a windows host running 
 Ganglia 3.0.7
  
 When I setup WinBox as my cluster collector node in gmetad, 
 all of the metrics that I expect to be present are there and 
 RRD files are created for them. 
 
 They are only missing if I configure WinBox to send to BoxA 
 and setup BoxA as my collecter in gmetad.conf
  
 Working config:
  
 WinBox gmond.conf
 udp_send_channel {
   host = WinBox
   port = 8649
 } 
  
 gmetad.conf
 data_source My Cluster WinBox.mydomain.com
 
 
 Failing config:
 
 WinBox gmond.conf
 udp_send_channel {
   host = BoxA
   port = 8649
 }
  
 BoxA gmond.conf
 udp_send_channel {
   host = BoxA
   port = 8649
 }
  
 gmetad.conf
 data_source My Cluster BoxA.mydomain.com
  
   
  So:
   - Can you look at the XML from your collector node (tcp port 8649)
 and check if it's actually missing the metrics?
 
 The metrics are there, and appear in gmetad when using WinBox 
 as my collector. When WinBox is configured to send to itself 
 (see working configuration above), telnetting to 8469 on the 
 server shows those metrics are present. 
 
 When I set BoxA to be the collecter and configure WinBox to 
 send to that, the metrics are NOT present when I telnet to BoxA 8649. 
  
   - Is your gmetad polling the correct collector nodes?
 
 Yes, it is - in both scenarios, WinBox appears appears in the 
 collection in Gmetad
 
 At this point, it looks as if the data is missing when it is 
 sent to BoxA, the Linux collector. At this point, I cannot 
 see the metrics in the XML output any more, so it makes sense 
 that it never arrives at the gmetad host. 
  
 On this basis, where do I go to investigate this further ? 
  
 On BoxA, sbin/gmond -m does not include any of the metrics 
 that I am collecting on my Windows box and wish to appear in 
 my cluster - is this relevant ? Do these metrics have to be 
 supported by this gmond to be collected? If so, how can I 
 configure them, given that my earlier attempt to include them 
 caused gmond to fail with the following message:
  
 Unable to collect metric 'phys_disk_time' on this platform. Exiting.
  
 Thanks in advance for any assistance that anyone can give :) 
 
 --
 Wayne Pascoe
 RBS Global Banking  Markets
 Office: +44 20 3361 9571   |  Mobile: +44 7799 707450 
 
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net 
 https://lists.sourceforge.net/lists/listinfo/ganglia-general 
 
 
 
 ***
 The Royal Bank of Scotland plc. Registered in Scotland No 90312. Registered 
 Office: 36 St Andrew Square, Edinburgh EH2 2YB. 
 Authorised and regulated by the Financial Services Authority. 
  
 This e-mail message is confidential and for use by the 
 addressee only. If the message is received by anyone other 
 than the addressee, please return the message to the sender 
 by replying to it and then delete the message from your 
 computer. Internet e-mails are not necessarily secure. The 
 Royal Bank of Scotland plc does not accept responsibility for 
 changes made to this message after it was sent. 
 
 Whilst all reasonable care has been taken to avoid the 
 transmission of viruses, it is the responsibility of the recipient to 
 ensure that the onward transmission, opening or use of this 
 message and any attachments will not adversely affect its 
 systems or data. No responsibility is accepted by The 
 

Re: [Ganglia-general] Gmond strange metric TN too large issue ormulticast metric lost?

2009-06-24 Thread Brad Nicholes
The TN value is simple indicating the time offset from the reported
timestamp that the metric was last received from the managed node.  In
other words it is the age of the metric.  A large number would indicated
that the metric value has not been updated for a long period of time. 
This might be because the reporting interval has been set to a very
large time period or it may have something to do with multicast packets
being lost in your network.  I would suggest that you run gmond in debug
mode -d 10 on both the managed node and the collecting node and then
try to correlate when the reporting node send one of your custom metrics
to when or if the collecting node received it.  The most obvious thing
that this would tell you is if your multicast packets are being lost or
blocked by a router in your network.  If the collecting node is actually
receiving the packet in a timely manner but the TN value is still large,
then we would have to look at a possible bug in gmond.
   The fact that you seem to be losing only the python metrics seems to
indicate that this might be either a configuration error or a problem
with the metric definition of the custom python metric.  Do you have the
same problem with any of the standard shipping python metrics?

Brad

 On 6/24/2009 at 9:39 AM, in message
bay140-w76585dd65f08f17a35d1bb3...@phx.gbl, liangfan
xfanli...@hotmail.com
wrote:

 I'm trying to figure out some very puzzling issue in our ganglia
system. 
 We are using ganglia 3.1.1.We get strange issue that some metric
always have 
 too large TN.
  
 The system is configured as:
 -We have gmond deployed on 16 nodes (A-001--A008, B001--B008).
 -Gmond is configured to use multicast mode, each node have all
metrics
  
 The issues are:
 -TN on some nodes is ok, while others have errors.
 -Some metric of one host are too large, while other metrics of the
same node 
 are ok.
 We guess kernel may drop these packages. You can see the detailed
analysis 
 in the end.
  
 I find a thread on mail list may relate to this:

http://www.mail-archive.com/ganglia-develop...@lists.sourceforge.net/msg02942.
 html
  
 MonAMI also has a page might relate to this:
 http://monami.sourceforge.net/tutorial/ar01s06.html 
  
 In preventing metric-update loss, it says that:
 The current Ganglia architecture requires each metric update be sent
as an 
 individual metric-update message. On a moderate-to-heavily loaded
machine, there 
 is a chance that gmond may not be scheduled to run as the messages
arrive. If 
 this happens, the incoming messages will be placed within the network
buffer. 
 Once the buffer is full, any subsequent metric-update messages will
be lost. 
 This places a limit on how may metric update messages can be sent in
one go. 
 For 2.4-series Linux kernels the limit is around 220 metric-update
messages; 
 for 2.6-series kernels, the limit is around 400. 
  
 However, we are still confusing about the symptoms:
 -We do not see much buffer in port 8649 REV_Q and our node are not
heavy 
 load.
 -Why all the core metric are received and update to now ,while almost
all the 
 custom python metric are lost and TN get too large?
 -Why some node always gets outdated custom python metric, while other
nodes 
 are ok?
  
 I've been scratching my head over this for almost a week now; I’ve
searched 
 ganglia mailing list archives, but can not get more info.
  
 Any help/suggestions/advice would be very much appreciated -- it's
really very 
 frustrating!
  
 Below is the detailed analysis
 -Detailed analysis
 Here is part of the xml out from A-001.
 telnet localhost 8649:
  
 HOST NAME=B-002 IP=X.X.X.119 REPORTED=1245822864 TN=9
TMAX=20 
 DMAX=0 LOCATION= GMOND_STARTED=1245710345
 METRIC NAME=proc_run VAL=0 TYPE=uint32 UNITS=  TN=45
TMAX=950 
 DMAX=0 SLOPE=both
 EXTRA_DATA
 EXTRA_ELEMENT 
NAME=GROUP VAL=process/
 /EXTRA_DATA
 /METRIC
 METRIC NAME=load_five VAL=1.13 TYPE=float UNITS=  TN=7
TMAX=325 
 DMAX=0 SLOPE=both
 EXTRA_DATA
 EXTRA_ELEMENT NAME=GROUP VAL=load/
 /EXTRA_DATA
 /METRIC
 
 METRIC NAME=WritesPerSec VAL=0.00 TYPE=float UNITS=
TN=112225 
 TMAX=60 DMAX=0 SLOPE=both
 EXTRA_ELEMENT NAME=GROUP VAL=Status/
 /EXTRA_DATA
 /METRIC
 METRIC NAME=db_used VAL=20233 TYPE=uint32 UNITS= TN=112225

 TMAX=60 DMAX=0 SLOPE=both
 EXTRA_DATA
 EXTRA_ELEMENT NAME=GROUP VAL=Status/
 /EXTRA_DATA
 /METRIC
 
From the xml, we can see gmond gets heartbeat info from B-002.TN of
all the 
 core metric collected by gmond (ex, proc_run,l oad_five) are ok,while
TN of 
 most metrics collected by our python module(ex, WritesPerSec,
db_used) 
 extension are large(TN=112225).
  
 We use tcpdump on B-002 and find B-002 send out all the metric to
multicast  
 address(X.X.X.119--239.X.X.110:8649).
 On A-001, we find A-001 receive all the multicast message
accordingly.(Get the 
 same X.X.X.119--239.X.X.110:8649 message in tcpdump).
 These means the multicast message reaches to A-001.
  
 Then we look use strace to trace gmond and find:
 On B-002 gmond send out all 

Re: [Ganglia-general] metric_cleanup not being called in my pythonmodule

2009-05-27 Thread Brad Nicholes
 On 5/26/2009 at 7:22 PM, in message
dcccdf790905261822w63e3447crf21bef5390b...@mail.gmail.com, David Birdsong
david.birds...@gmail.com wrote:
 I've been searching around for awhile now, any suggestions on where I
 can get apr-debug ...or is it apr-util-debug?   I'm on Fedora 8, but
 source is fine.
 

I'm not sure where to find a built RPM.  I have always just built it myself.  
You want a debug version of APR.  Gmond doesn't use apr-util.

Brad



 On Tue, May 26, 2009 at 8:35 AM, Brad Nicholes bnicho...@novell.com wrote:
 On 5/24/2009 at 12:43 AM, in message
 dcccdf790905232343y76481e5dw6c1df62bc732c...@mail.gmail.com, David Birdsong
 david.birds...@gmail.com wrote:
 I have a python module that spawns a separate thread that collects
 data off of a pipe.

 Everything runs fine, but I'm finding that metric_cleanup is never
 called.  When I strace the PID of the worker thread(in Linux so it
 get's it's own PID), I see it gets a SIGTERM when I stop gmond instead
 of exiting under it's own power.  All of the gmond processes exit, but
 a  subprocess of my worker thread just ends up being reparented to
 init instead of being cleaned up by my metric_cleanup logic.

 The worker thread reads from an endless pipe using a select.poll with
 a timeout, so the pipe shouldn't block.  I need to know to kill the
 process on the other end of the pipe which is what metric_cleanup
 should be providing.

 I even removed all cleanup code from metric_cleanup() and just put an
 open('/tmp/ganglia_kill', 'w'),...but no file is created.  What can I
 investigate to understand why it's being ignored?


 Gmond depends on the APR memory pools for invoking the cleanups.  Basically 
 the way it works is that when gmond starts up, it creates an APR memory pool. 
  This memory pool is used to allocate and manage memory of everything in 
 gmond that deals with APR.  One of the features of APR memory pools is that I 
 can tie functions to a pool that get invoked with the memory pool is cleaned 
 up.  In this case in the function setup_metric_callbacks() in gmond.c, it is 
 tying all of the module cleanup functions to the main global memory pool.  
 When gmond exits, the last thing that happens is that the memory pools that 
 were created, are destroyed.  This should trigger all of the cleanup 
 routines.  To debug this, you will need a debug version of APR and set a 
 break point in apr_terminate().  Also set a break point in apr_pool_destroy.  
 These functions should be getting called automatically when the gmond process 
 shuts down.  Another quick workaround would be to explicitly call 
 apr_pool_destroy (global_context) as the last statement in main.c.  This will 
 force the destruction of the global memory pool which should also cause the 
 clean up routines to be called.

 Brad






--
Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT 
is a gathering of tech-side developers  brand creativity professionals. Meet
the minds behind Google Creative Lab, Visual Complexity, Processing,  
iPhoneDevCamp as they present alongside digital heavyweights like Barbarian 
Group, R/GA,  Big Spaceship. http://p.sf.net/sfu/creativitycat-com 
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] metric_cleanup not being called in my pythonmodule

2009-05-26 Thread Brad Nicholes
 On 5/24/2009 at 12:43 AM, in message
dcccdf790905232343y76481e5dw6c1df62bc732c...@mail.gmail.com, David Birdsong
david.birds...@gmail.com wrote:
 I have a python module that spawns a separate thread that collects
 data off of a pipe.
 
 Everything runs fine, but I'm finding that metric_cleanup is never
 called.  When I strace the PID of the worker thread(in Linux so it
 get's it's own PID), I see it gets a SIGTERM when I stop gmond instead
 of exiting under it's own power.  All of the gmond processes exit, but
 a  subprocess of my worker thread just ends up being reparented to
 init instead of being cleaned up by my metric_cleanup logic.
 
 The worker thread reads from an endless pipe using a select.poll with
 a timeout, so the pipe shouldn't block.  I need to know to kill the
 process on the other end of the pipe which is what metric_cleanup
 should be providing.
 
 I even removed all cleanup code from metric_cleanup() and just put an
 open('/tmp/ganglia_kill', 'w'),...but no file is created.  What can I
 investigate to understand why it's being ignored?
 

Gmond depends on the APR memory pools for invoking the cleanups.  Basically the 
way it works is that when gmond starts up, it creates an APR memory pool.  This 
memory pool is used to allocate and manage memory of everything in gmond that 
deals with APR.  One of the features of APR memory pools is that I can tie 
functions to a pool that get invoked with the memory pool is cleaned up.  In 
this case in the function setup_metric_callbacks() in gmond.c, it is tying all 
of the module cleanup functions to the main global memory pool.  When gmond 
exits, the last thing that happens is that the memory pools that were created, 
are destroyed.  This should trigger all of the cleanup routines.  To debug 
this, you will need a debug version of APR and set a break point in 
apr_terminate().  Also set a break point in apr_pool_destroy.  These functions 
should be getting called automatically when the gmond process shuts down.  
Another quick workaround would be to explicitly call apr_pool_destroy 
(global_context) as the last statement in main.c.  This will force the 
destruction of the global memory pool which should also cause the clean up 
routines to be called.

Brad 


--
Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT
is a gathering of tech-side developers  brand creativity professionals. Meet
the minds behind Google Creative Lab, Visual Complexity, Processing,  
iPhoneDevCamp asthey present alongside digital heavyweights like Barbarian
Group, R/GA,  Big Spaceship. http://www.creativitycat.com 
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] CVE-2009-0241

2009-03-10 Thread Brad Nicholes
 On 3/10/2009 at 1:14 PM, in message m3fxhlchqn@unna.nsc.liu.se, Leif
Nixon ni...@nsc.liu.se wrote:
 Linkoping University

The issue has been there for a while.  See the associated bug report.  Also 
since it is an issue with the interactive port, the attacker would have to have 
access to that port.  In most cases the port should have been protected by 
either the trusted_hosts configuration or the fact that gmetad should be 
running behind a firewall which would prevent external access.  Since only the 
web front end (ie. apache) and/or a downstream gmetad need access to the 
interactive port,  there should really be no reason for this port to be exposed 
to anything else both inside or outside a firewall.

Brad




--
Apps built with the Adobe(R) Flex(R) framework and Flex Builder(TM) are
powering Web 2.0 with engaging, cross-platform capabilities. Quickly and
easily build your RIAs with Flex Builder, the Eclipse(TM)based development
software that enables intelligent coding and step-through debugging.
Download the free 60 day trial. http://p.sf.net/sfu/www-adobe-com
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


[Ganglia-general] [ANNOUNCEMENT] - Release Ganglia 3.1.2

2009-02-17 Thread Brad Nicholes
The Ganglia Project (http://ganglia.info) is pleased to announce the  
official release of Ganglia 3.1.2  The official tarball is available for 
immediate download at:

http://sourceforge.net/project/showfiles.php?group_id=43021package_id=35280release_id=661845
 

For a full description of the bug fixes and enhancements that are included
in the 3.1.2 release as well as upgrade information, please see the current 
release notes at:

http://ganglia.wiki.sourceforge.net/ganglia_release_notes 


Supported platforms:

  * Linux (Fedora/RedHat/CentOS, Debian, Gentoo, SuSE/OpenSuSE)
  * [Open]Solaris
  * FreeBSD
  * NetBSD
  * OpenBSD
  * DragonflyBSD
  * Cygwin (no support for DSO yet)
  * AIX (no support for DSO yet)

Please read all the README, INSTALL and other available documentation 
(http://ganglia.wiki.sourceforge.net) as a lot of things have changed since 
version 3.0. Use good deployment practices when upgrading from 3.0.x to make 
sure 
that you do not mix gmond 3.0 and 3.1 nodes in the same cluster (as defined by 
a multicast address or unicast collector node).  The protocol that allows gmond 
nodes to communicate within the same cluster, has changed.  However the XML 
packets that are passed between gmond and gmetad have remained compatible from 
3.0.x to 3.1.x, allowing a 3.1.x gmetad to continue to pull data from an older 
3.0.x gmond cluster.

Ganglia Development Team





--
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [ANNOUNCEMENT] Ganglia 3.1.2 testing tarball...

2009-02-09 Thread Brad Nicholes
We are halfway through the testing period and I haven't heard any feedback on 
the list about how testing is going with 3.1.2.  So the natural assumptions are 
that either nobody is testing the latest version or that testing is going so 
well that there is really nothing to report.  I am hoping that it is the 
latter. :)  If anybody has anything to report (good or bad), please send a 
quick email to the list.

thanks,
Brad 

 On 1/30/2009 at 8:18 AM, in message
4982b7ef02ac0003a...@lucius.provo.novell.com, Brad Nicholes
bnicho...@novell.com wrote:
 In an effort to continue improving the Ganglia software, the Ganglia Project  
 
 has released an official testing release of Ganglia 3.1.2.  The testing 
 tarball 
 is available for immediate download at:
 
 http://www.ganglia.info/testing/ 
 
 The intent of this testing release of Ganglia 3.1.2 is to validate that 
 the source code is stable and that the bug fixes and enhancements that have
 been added since the previous release of the software, are ready for general 
 
 release.  The release procedure from this point has been documented on the 
 Ganglia wiki site at http://ganglia.wiki.sourceforge.net/ganglia_works under 
 
 the heading Generating a Release Candidate and GA Release.  
 
 Basically the Ganglia 3.1.2 testing tarball has been rolled and made 
 available 
 for testing by the Ganglia community.  All bugs found in this testing 
 release 
 should be immediately reported through bugzilla 
 (http://bugzilla.ganglia.info) 
 and can be posted to the ganglia-develop...@lists.sourceforge.net mailing 
 list 
 as well.  If the bug report is also accompanied by a bug fix patch, this 
 will 
 help avoid delays in producing new testing tarballs and ultimately an 
 official 
 general release of the software.  If any critical level bugs are discovered, 
 
 the current testing release tarball will be thrown away and a new tarball 
 will 
 be rolled and made available for further testing.  Once a testing release 
 tarball has been validated by the Ganglia community to be stable and ready 
 for 
 general availability, that tarball will become the official Ganglia 3.1.x 
 release.  So basically the sooner we are able to test and validate the 
 Ganglia 
 3.1 source code, the sooner the project will be able to create an official 
 release.  But we need your help to get this done.  Any and all testing and 
 feedback, positive or negative, will be greatly appreciated.
 
 There will be a two week testing period for this 3.1.2 tarball which begins 
 from
 the date of this announcement.  So please help us to make sure that the 
 tarball
 is valid and stable by building and installing it on any size of testing 
 environment.
 
 Known issues with this testing release will be addressed on the Ganglia wiki
 site at:
 
 http://ganglia.wiki.sourceforge.net/Testing_Release_Notes 
 
 
 For those who are interested in upgrading from a current 3.0.x installation, 
 
 please see the current release notes at:
 
 http://ganglia.wiki.sourceforge.net/ganglia_release_notes 
 
 Supported platforms (additional testing requested):
 
   * Linux (Fedora/RedHat/CentOS, Debian, Gentoo, SuSE/OpenSuSE)
   * [Open]Solaris
   * FreeBSD
   * NetBSD
   * OpenBSD
   * DragonflyBSD
   * Cygwin (no support for DSO yet)
   * AIX (no support for DSO yet)
 
 Please read all the README, INSTALL and other available documentation 
 (http://ganglia.wiki.sourceforge.net) as a lot of things have changed since 
 3.0.7. Use good deployment practices when upgrading from 3.0.x to make sure 
 that you do not mix gmond 3.0 and 3.1 nodes in the same cluster (as defined 
 by 
 a multicast address or unicast collector node).  The protocol that allows 
 gmond 
 nodes to communicate within the same cluster, has changed.  However the XML 
 packets that are passed between gmond and gmetad have remained compatible 
 from 
 3.0.x to 3.1.x, allowing a 3.0.x gmetad to continue to pull data from a 
 newer 
 3.1.x gmond cluster.
 
 happy testing
 
 
 
 
 
 --
 This SF.net email is sponsored by:
 SourcForge Community
 SourceForge wants to tell your story.
 http://p.sf.net/sfu/sf-spreadtheword 
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net 
 https://lists.sourceforge.net/lists/listinfo/ganglia-general 



--
Create and Deploy Rich Internet Apps outside the browser with Adobe(R)AIR(TM)
software. With Adobe AIR, Ajax developers can use existing skills and code to
build responsive, highly engaging applications that combine the power of local
resources and data with the reach of the web. Download the Adobe AIR SDK and
Ajax docs to start building applications today-http://p.sf.net/sfu/adobe-com
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https

Re: [Ganglia-general] ganglia upgrade

2009-01-20 Thread Brad Nicholes
 On 1/20/2009 at 10:48 AM, in message
32128a4489900844a3dba3e8273be22e1348965...@in01wxmbx1.internal.synopsys.com,
Hardik Shah hardik.s...@synopsys.com wrote:
 Hi,
 
 Does anyone has any information on upgrade on ganglia cluster?
 I have configured around 200 machines with ganglia 3.0.7 but now I want to 
 upgrade the cluster to 3.1 version which is having more features.
 
 Any suggestion would be appreciated.
 
 -Hardik

Check out the release notes at:

http://ganglia.wiki.sourceforge.net/ganglia_release_notes


--
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] custom metric's value doesn't update --custom python metric modules on 3.1.1

2008-12-12 Thread Brad Nicholes
  I haven't tried to actually run your module yet, but can this be a 
permissions problem.  What user are you running gmond as?  Does that user have 
permissions to run rndc and access named.stats?  All modules run by gmond will 
be run as the same user as gmond.  Therefore you have to make sure that the 
user that gmond is running as, has sufficient permissions to perform any work 
being done by the modules.

Brad


 On 12/12/2008 at 4:24 AM, in message
dcccdf790812120324j79fbd1e1y14ebf50f6313f...@mail.gmail.com, David Birdsong
david.birds...@gmail.com wrote:
 I'm having the same exact problem.  I can run the if __name__ ==
 '__main__' test and have the script define and execute the callbacks
 via the descriptor list, but when running from gmond, all that gets
 called is the metric_init()
 
 I put an open in the init and set the open filename to global as a
 test because I thought maybe I couldn't rely on STDOUT from inside the
 hanlder.  Only the metric_init writes to it depite my putting a file
 write in the handler.
 
 Any clue on why it's gmond's initializing but not calling the handler
 in the module?
 
 Heres' my pyconf:
 modules {
   module {
 name = dns
 language = python
   }
 }
 
 collection_group {
   collect_every = 20
   time_threshold = 45
   metric {
 name = zDNS_success
 title = Temperature
 value_threshold = 10
   }
 
   metric {
 name = zDNS_nxrrset
 title = Temperature
 value_threshold = 10
   }
 
   metric {
 name = zDNS_referral
 title = Temperature
 value_threshold = 10
   }
 
   metric {
 name = zDNS_nxdomain
 title = Temperature
 value_threshold = 10
   }
 
   metric {
 name = zDNS_recursion
 title = Temperature
 value_threshold = 10
   }
 
   metric {
 name = zDNS_failure
 title = Temperature
 value_threshold = 10
   }
 
   metric {
 name = zDNS_duplicate
 title = Temperature
 value_threshold = 10
   }
 
   metric {
 name = zDNS_dropped
 title = Temperature
 value_threshold = 10
   }
 
 
 }
 
 
 ## Here's my python module:
 #!/usr/bin/python
 
 
 import re
 import sys
 import time
 import subprocess
 
 NamePrefix = 'zDNS'
 LastRecord = {}
 MaxAge  = 30
 descriptors = []
 
 out_file = ''
 
 def return_collection_data(collection_string):
   global NamePrefix
   collection_date_re = re.compile(r'^.*\(([0-9]+)\)$')
   collection = filter(lambda x: x, collection_string.split('\n'))
   collection_date = collection_date_re.search(collection.pop()).group(1)
 
   collection_map = {}
   for record in collection:
 name, value = record.split()
 name = '%s_%s' % (NamePrefix, name)
 collection_map[name] = int(value)
 
   return (int(collection_date), collection_map)
 
 def return_latest_record():
 
   collections_splitter_re =
 re.compile(r'^\+\+\+\sStat.*Dump\s\+\+\+\s\([0-9]+\)$', re.MULTILINE)
   cmd = ['/usr/sbin/rndc', 'stats']
   x = subprocess.Popen(cmd)
   r_code = x.wait()
   if r_code != 0 :
 print  sys.stderr, 'bad things happened calling %s ' % cmd
 print  sys.stderr, 'need to throw and catch an exception'
 
   f = '/var/named/named.stats'
   collections = collections_splitter_re.split(open(f).read())
   collections = map(lambda x: x.strip(), collections)
   foo = {}
   for collection in collections:
 collection = collection.strip()
 if not collection: continue
 collection_date, collection_data = return_collection_data(collection)
 foo[collection_date] = collection_data
   k = foo.keys()
   k.sort()
   return (time.time(), foo[k.pop()])
 
 def named_stats_handler(value):
   global LastRecord
   global MaxAge
   global out_file
 
   out_file.write('in handler name %s  \n' % value)
   timestamp, data = LastRecord
   if time.time() - timestamp  MaxAge: timestamp, data = 
 return_latest_record()
   LastRecord = (timestamp, data)
   out_file.write('Current Record %s\n' % LastRecord.__str__())
   out_file.flush()
   return data[value]
 
 def metric_init(params):
   global descriptors
   global NamePrefix
   global LastRecord
   global out_file
 
   out_file = open('/tmp/ganglia_david', 'w')
   LastRecord = return_latest_record()
   out_file.write('LastRecord %s\n' % LastRecord.__str__())
   out_file.flush()
   descriptors = [
 
   { 'name' : '%s_success' % NamePrefix,
  'call_back': named_stats_handler,
  'time_max': 60,
  'value_type': 'uint32',
  'units': 'ticks',
  'slope': 'positive',
  'format': '%u',
  'description': 'Successful Queries',
  'groups': NamePrefix
 },
 { 'name' : '%s_nxrrset' % NamePrefix,
  'call_back': named_stats_handler,
  'time_max': 60,
  'value_type': 'uint32',
  'units': 'ticks',
  'slope': 'positive',
  'format': '%u',
  'description': 'Dunno',
  'groups': NamePrefix
 },
 { 'name' : '%s_referral' % NamePrefix,
  'call_back': named_stats_handler,

Re: [Ganglia-general] Monitor Apache

2008-12-11 Thread Brad Nicholes
 On 12/11/2008 at 11:33 AM, in message 49415cf1.1010...@greenberg.org, Ed
Greenberg e...@greenberg.org wrote:
 Michael Henderson wrote:
 Hello all,

 Is there a way to monitor apache through ganglia?

 Thanks,

 ~Mike


 
 I'm interested in seeing what others say but... 
 
 I rolled my own as follows:
 
 1. I turned on server-status and ExtendedStatus
 
 2. I wrote a script to retrieve http://localhost/server-status?auto   
 which returns this:
 Total Accesses: 123048
 Total kBytes: 1445888
 CPULoad: .270592
 Uptime: 20688
 ReqPerSec: 5.9478
 BytesPerSec: 71567.5
 BytesPerReq: 12032.6
 BusyWorkers: 4
 IdleWorkers: 71
 Scoreboard: 
 
 I wrote a script to use gmetric to pump the values above (those I want) 
 into ganglia.  (Available on request.)
 

We did the same thing except rather than use gmetric, we implemented it as a 
python module.  It was very simple to do, it is basically just a command line 
call to get the apache status output and then parse the output into the various 
metrics.

Brad


--
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


[Ganglia-general] Spoofing functionality in 3.1.x branch...

2008-12-04 Thread Brad Nicholes

For those that are interested in the module based spoofing feature, all of the 
functionality should be complete and has been backported to the 3.1.x branch.  
I have also added some spoofing module examples to trunk that can be downloaded 
from monitor-core/gmond/python_modules/example/spfexample.py in the trunk 
repository.  There  is also a small .pyconf file in 
monitor-core/gmond/python_modules/conf.d/spfexample.pyconf.  This example 
module should give you enough guidance so that you can build your own spoofing 
module.  Please let me know if anything is missing.

Brad


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


[Ganglia-general] Testing BETA 3.1.x available...

2008-12-04 Thread Brad Nicholes
   There is a new BETA tarball and RPMs on the Ganglia testing site 
(http://www.ganglia.info/testing/).  The following includes a list of 
enhancements and bug fixes that are currently available in this testing BETA 
release.  

  * gmond/gmetad: Sync-up the default values for the cluster section of 
 gmond with the default gmond.conf so that a cluster name will always 
 be present. The gmetad code can not handle a host with no associated 
 cluster, therefore the gmond code must always include a cluster XML tag.  
 Bug #200
  * gmond: Add an 'enabled' directive to the module section so that a 
 module can easily be enabled or disabled through the configuration 
 file  
  * gmond: reformat memory metrics to match pre 3.1 style 
  * gmond: -r support for transforming 2.5 configurations 
  * gmond: add boolean option to 'allow_extra_data' generation (BUG199)
  * gmond: include localhost in translated (-r) trusted_hosts from 2.5
  * gmetad: skip unresponsive sources (BUG92)
  * libganglia: mcast_if support in gmond (BUG140)
  * web: add boolean option for using hostname without domainname for graphs
  * web: add host atrributes into metric list (BUG30)
  * web: metric group enhancements for host view (BUG203)
  * web: add option for configurable number of columns in cluster view (BUG194)
  * web: make number of metric columns in host view configurable (BUG194)
  * Allow both a C and python module to create a metric that will spoof a
specific host.  This provides the same spoofing functionality as gmetric
but through a metric module.  It is done by adding SPOOF_HOST and
SPOOF_NAME as extra metadata to the metric description
  * gmond: mod_python support for versions older than 2.3 or newer than 2.4
  * mod_python: Change the way that the python module path is added to better
 support the Solaris platform.  It is also a cleaner way to add the
 python path programmatically rather than altering the PYTHON_PATH
 environment variable.
  * gmetric: Support the short commandline parameter format when spoofing
 a heartbeat metric.
  * Bug fixes and Enhancements

This is just a testing BETA release that is not yet ready for production use.  
Please test this code and let us know if we have missed or broken anything.

Thanks,
Brad




--
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] nonresponsive gmond

2008-12-01 Thread Brad Nicholes
 On 11/29/2008 at 11:54 AM, in message [EMAIL PROTECTED],
Kostas Georgiou [EMAIL PROTECTED] wrote:
 On Tue, Nov 04, 2008 at 10:02:32AM -0700, Brad Nicholes wrote:
 
  On 11/3/2008 at  5:27 PM, in message [EMAIL PROTECTED],
 Kostas Georgiou [EMAIL PROTECTED] wrote: 
  On Mon, Nov 03, 2008 at 11:46:52PM +, Kostas Georgiou wrote:
  
  On Mon, Nov 03, 2008 at 01:55:22PM -0700, Brad Nicholes wrote:
   
   If a timeout is set, then is the resulting XML output still good or did 
   we 
 
  lose something because of the timeout?
  
  No, it seems to be working fine. I am testing with:
  
  Actually I was wrong there was enough data in the socket buffers to
  confuse me. The xml output is truncated in the slow reader :(
  
 
 Attached is a patch against trunk which implements a lingering close.
 I am not sure if this will solve the problem but Apache does a similar
 thing to make sure that both sides get a chance to complete the
 conversation before closing the socket.  Apply this patch, let it run
 for a while and let's see if this solves the problem.
 
 I just got a non responsive gmond and after looking at the network
 traces it seems that:
 
 gmond tries to write to what it thinks is a still alive connection
 so it is blocked there.
 On the gmetad side there is no such connection so the firewall replies
 with ICMP host foo unreachable - admin prohibited. Unfortunately this
 doesn't cause the connection to be dropped on the gmond side (will
 anything else than RST work at this point?) and gmond keeps trying..
 
 At this point it's too late to tell why the connection wasn't closed
 properly (was the FIN packet lost somehow?) but using a short keepalive
 setting in the gmond side can not hurt and will help in cases like this
 one.
 

This is the real question.  Who initiated the close and why?  This would be 
much easier to debug if we could somehow figure out how to reproduce the 
problem reliably.  Since I am not able to reproduce this problem, I am 
wondering if it might have something to do with the OS or version of the socket 
library the OS is using.  We have heard of this happening on CentOS and 
Solaris.  Is there anything in common about the socket libraries between these 
two OSs?  I guess the band aid would be to put in a timeout and abort on 
gmond's write function.

Brad




-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetric fails when disk is unwriteable?

2008-11-26 Thread Brad Nicholes
 On 11/26/2008 at 3:45 AM, in message
[EMAIL PROTECTED], Martin Knoblauch
[EMAIL PROTECTED] wrote:
 - Original Message 
 
 From: Brad Nicholes [EMAIL PROTECTED]
 To: Ofer Inbar [EMAIL PROTECTED]
 Cc: ganglia-general@lists.sourceforge.net 
 Sent: Tuesday, November 25, 2008 8:43:08 PM
 Subject: Re: [Ganglia-general] gmetric fails when disk is unwriteable?
 
  On 11/25/2008 at 10:14 AM, in message 
 [EMAIL PROTECTED],
 Ofer Inbar wrote:
  Brad Nicholes wrote:
  It needs a temp directory to get around some issues with libconfuse.
  Libconfuse doesn't actually support wildcard paths or files.  A
  libconfuse include statement must have a full path to the file that
  it is going include.  So gmond makes up for this problem by creating
  a temp file, resolving all of the file paths and names and then
  writing them as separate includes in a temp file.  Then it tells
  libconfuse to include the temp file directly.  Without the ability
  to resolve the wildcard paths and write them to a temp file, the
  wildcarding feature of gmond wouldn't work.  To solve the problem
  that you are describing, we would have to actually add wildcard
  capability to libconfuse.
  
  Might this be cleaner workaround that would work for gmond as well?
  
   - override libconfuse's include function as you're already doing
   - resolve file paths and names as you're already doing
   - instead of writing that to a temp file and telling libconfuse to
 include that file, just tell libconfuse to include each individual
 file (the same filenames you're now writing to the temp file)
  
 
 No, libconfuse doesn't work that way.  The include handler can only 
 manipulate 
 the file path that it is handed.  So the result of the handler has to be a 
 single absolute file path.  There isn't any way to take a single file path 
 as 
 input into the handler and return multiple file paths back to libconfuse.  
 The 
 only way to do it was to write all of the individual file paths to a file 
 and 
 then hand libconfuse back a single file path to the new include file.
 
 
  the question is: can't the handler be rewritten to the conversions in 
 memory, without needing to write a temp file? This would make the process 
 more robust. You never know when a disk is full, or goes RO.
 

No, I tried doing that already but was unsuccessful.  Libconfuse is limited in 
what you can do in this area.  The problem is that when libconfuse wants to 
read in the include file, it is in the middle of the lexer and needs to 
continue.  A handler can't just read the file and hand it back to libconfuse 
through some other cfg_* call.  This may be a design flaw in libconfuse but it 
is the way it works now and we have to live with it. 

Brad


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetric fails when disk is unwriteable?

2008-11-26 Thread Brad Nicholes
 On 11/26/2008 at 1:17 AM, in message [EMAIL PROTECTED],
Carlo Marcelo Arenas Belon [EMAIL PROTECTED] wrote:
 On Tue, Nov 25, 2008 at 04:33:05PM -0700, Brad Nicholes wrote:
 
 The result was that if the wildcard produced more than 10 included files
 (which it easily does even in our default configuration), libconfuse
 choked because it thought it had hit the maximum nesting level
 
 our RPMs for ganglia only install 3 files in /etc/ganglia/conf.d; gentoo
 has 2 and fedora 10 (just released) has 4.
 
 even if I agree that 10 is somehow low and you would expect that as more
 modules are deployed it will be soon problematic, it would seem that at
 least in this case, one problem was traded for another.
 

The fact is that 10 is low which is why I discovered that last year when I 
implemented the wildcard path support.  In our case we routinely run with 20+ 
modules and configure them using separately included .conf files so that each 
one can be easily turned on or off by simply renaming the included .conf file.  
This is a very valuable feature which isn't unique to ganglia.  Limiting this 
very useful feature now in gmond on the remote chance that a file system might 
go read only and cause an issue for gmetric, isn't a very good trade off.  It 
isn't that one problem was traded for another.  

At the time when I implemented the code to support wildcard paths, nobody knew 
anything about gmetric not being able to run in a read only file system.  There 
was no trade off begin made.  The fact is that whether or not gmetric is able 
to run in a read only file system is a much smaller issue than allowing gmond 
or gmetric to run in an undetermined state because the code allows parts of the 
configuration to be ignored.  Introducing a patch that knowingly ignores parts 
of the configuration due to errors in the file system is unacceptable behavior. 
 The bug that this kind of patch introduces is much larger than an issue with 
gmetric not being able to run in a read only environment.  Gmond being able to 
resolve wildcard paths is a standard feature and behavior that is used every 
day, gmetric being able to run in a read only file system is not.  The real 
issue is why did the disk go read only.  There are plenty of gmond metrics that 
provide the administrator with warnings and a metric module that indicates when 
a file system has gone read only is extremely easy to write.   

A more acceptable solution to the gmetric problem is to provide gmetric with 
its own .conf file that contains just the socket and port information rather 
than pointing gmetric at gmond.conf.  In this case both gmond and gmetric will 
continue to run even in a read only file system.  This solution can be easily 
implemented today without any code changes and especially without a code patch 
that introduces a much more serious bug.  If we need to solve the gmetric being 
able to run in a read only file system, then we need to come up with a better 
patch.  Crippling gmond and gmetric with a patch that allows them to ignore a 
fatal error because parts of the configuration was skipped, is not an 
acceptable patch.

Brad


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetric fails when disk is unwriteable?

2008-11-25 Thread Brad Nicholes
 On 11/25/2008 at 1:08 AM, in message [EMAIL PROTECTED],
Carlo Marcelo Arenas Belon [EMAIL PROTECTED] wrote:
 On Mon, Nov 24, 2008 at 04:55:42PM -0700, Brad Nicholes wrote:
  On 11/24/2008 at 3:47 PM, in message 
 [EMAIL PROTECTED],
 Ofer Inbar [EMAIL PROTECTED] wrote:
   I tried feeding one of my custom metrics by hand:
   [root ~]$ gmetric --name net_smtp_fin_wait2_out --value 0 --type uint8 
   --units 
   'connections'
   /etc/ganglia/gmond.conf:94: failed to determine the temp dir
   Parse error for '/etc/ganglia/gmond.conf'
 
 It needs a temp directory to get around some issues with libconfuse.
 
 gmond does; gmetric doesn't need anything more than to know which
 channel to use (hence nothing in the includes) and it is getting
 blocked by this restriction because of its use of libganglia to
 read gmond's configuration through libgmond.
 

Anything can be included from the main gmond.conf file.  There is nothing that 
says that a user can't put socket and channel information in a separate file 
and then include it from gmond.conf.  So making the assumption that gmetric 
doesn't need includes is false.  If this is a real problem for users, then 
gmetric should be using a different .conf file that only contains the socket 
information rather than using the same gmond.conf file that contains all of the 
metric information and includes.  Also, both gmond and gmetric both use the 
same code path for resolving the configuration, so if the code is changed to 
ignore configuration failures for gmetric, it is also changed to ignore 
configuration failures for gmond.  This isn't a good thing.  This problem 
doesn't require a code change to be resolved.  Simple documentation for gmetric 
would solve the problem.

 To solve the problem that you are describing, we would have to actually
 add wildcard capability to libconfuse.
 
 libconfuse is instructed to use our implementation for includes and that
 uses a temporary file, so this is fixable in our code.
 
 a fix to the problem reported by Ofer only needs our handler modified
 so that failures to create temporary files to handle includes are not
 treated as fatal as Committed revision 1922
 

No, libconfuse doesn't work that way.  The include handler only allows gmond to 
manipulate the input into a form that libconfuse can handle.  In this case the 
input is a single wildcard file path that needs to be translated into a single 
absolute file path.  libconfuse can not handle wild card paths.  Also 
libconfuse only knows how to get its input from a file.  The gmond include 
handler is only manipulating the wildcard path into an absolute path to a file 
that contains all of the resolved paths.  At that point libconfuse is able to 
read and process all of the included files through absolute paths.  The include 
handler has nothing to do with just translating a single wildcard path into 
multiple absolute paths and then handing them back to libconfuse in memory.  
These include paths have to be written to a file first and then libconfuse has 
to be told where the new file is.  This problem can't be fixed by just changing 
the include handler, otherwise I would have done it that way.

Revision 1922 currently breaks the configuration file handling and needs to be 
reverted.

Brad



-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetric fails when disk is unwriteable?

2008-11-25 Thread Brad Nicholes
 On 11/25/2008 at 10:14 AM, in message [EMAIL PROTECTED],
Ofer Inbar [EMAIL PROTECTED] wrote:
 Brad Nicholes [EMAIL PROTECTED] wrote:
 It needs a temp directory to get around some issues with libconfuse.
 Libconfuse doesn't actually support wildcard paths or files.  A
 libconfuse include statement must have a full path to the file that
 it is going include.  So gmond makes up for this problem by creating
 a temp file, resolving all of the file paths and names and then
 writing them as separate includes in a temp file.  Then it tells
 libconfuse to include the temp file directly.  Without the ability
 to resolve the wildcard paths and write them to a temp file, the
 wildcarding feature of gmond wouldn't work.  To solve the problem
 that you are describing, we would have to actually add wildcard
 capability to libconfuse.
 
 Might this be cleaner workaround that would work for gmond as well?
 
  - override libconfuse's include function as you're already doing
  - resolve file paths and names as you're already doing
  - instead of writing that to a temp file and telling libconfuse to
include that file, just tell libconfuse to include each individual
file (the same filenames you're now writing to the temp file)
 

No, libconfuse doesn't work that way.  The include handler can only manipulate 
the file path that it is handed.  So the result of the handler has to be a 
single absolute file path.  There isn't any way to take a single file path as 
input into the handler and return multiple file paths back to libconfuse.  The 
only way to do it was to write all of the individual file paths to a file and 
then hand libconfuse back a single file path to the new include file.

Brad 


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetric fails when disk is unwriteable?

2008-11-25 Thread Brad Nicholes
 On 11/25/2008 at 10:14 AM, in message [EMAIL PROTECTED],
Ofer Inbar [EMAIL PROTECTED] wrote:
 Brad Nicholes [EMAIL PROTECTED] wrote:
 It needs a temp directory to get around some issues with libconfuse.
 Libconfuse doesn't actually support wildcard paths or files.  A
 libconfuse include statement must have a full path to the file that
 it is going include.  So gmond makes up for this problem by creating
 a temp file, resolving all of the file paths and names and then
 writing them as separate includes in a temp file.  Then it tells
 libconfuse to include the temp file directly.  Without the ability
 to resolve the wildcard paths and write them to a temp file, the
 wildcarding feature of gmond wouldn't work.  To solve the problem
 that you are describing, we would have to actually add wildcard
 capability to libconfuse.
 
 Might this be cleaner workaround that would work for gmond as well?
 
  - override libconfuse's include function as you're already doing
  - resolve file paths and names as you're already doing
  - instead of writing that to a temp file and telling libconfuse to
include that file, just tell libconfuse to include each individual
file (the same filenames you're now writing to the temp file)
 

At one point I had tried to do exactly what is being suggested here.  See 
revision

http://ganglia.svn.sourceforge.net/viewvc/ganglia?view=revrevision=813

The problem that I ran into was that libconfuse thought that each call to 
cfg_include() meant that the include was nested deeper rather than at the same 
level.  The result was that if the wildcard produced more than 10 included 
files (which it easily does even in our default configuration), libconfuse 
choked because it thought it had hit the maximum nesting level even through we 
were still at a nesting level of one.

Brad


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetric fails when disk is unwriteable?

2008-11-24 Thread Brad Nicholes
 On 11/21/2008 at 9:33 PM, in message [EMAIL PROTECTED],
Ofer Inbar [EMAIL PROTECTED] wrote:
 One of our servers encountered an I/O error that put its root
 filesystem into read only mode.  Both /var and /tmp are on that
 filesystem, so all logging stopped and most everything stopped.
 
 However, gmond kept on running, and reporting metrics.  Great!
 This is yet another way in which Ganglia wins over most other
 monitoring systems that involve scripts that write things to disk or
 otherwise depend on things (such as ssh logins) that need to write to
 disk.
 
 However, a program I have that feeds custom metrics to gmond via
 gmetric stopped working when the / filesystem went read-only.  I
 tried running it in debug mode, and got this error:
 
   /etc/ganglia/gmond.conf:94: failed to determine the temp dir
   Parse error for '/etc/ganglia/gmond.conf'
 
 Line 94 of gmond.conf is:
   include ('/etc/ganglia/conf.d/*.conf') 
 
 We've never had an /etc/ganglia/conf.d directory, it always ignores that.
 
 I tried feeding one of my custom metrics by hand:
 [root ~]$ gmetric --name net_smtp_fin_wait2_out --value 0 --type uint8 
 --units 
 'connections'
 /etc/ganglia/gmond.conf:94: failed to determine the temp dir
 Parse error for '/etc/ganglia/gmond.conf'
 
 Then, I cd'ed over to a filesystem that is still in read/write mode:
 [root /otherfilesys]$ gmetric --name net_smtp_fin_wait2_out --value 0 --type 
 uint8 
 --units 'connections'
 
 No error, and it worked.
 
 What's the dependency that causes gmetric to require that the
 filesystem the CWD is on be writeable?  Does it really need that
 dependency?  It's great that Ganglia is so robust in the face of
 failures, but it'd be even better if gmetric were also as robust.
   -- Cos
 

Both gmetric and gmond read the same .conf file.  If the .conf file has an 
include() statement that specifies a wildcard file path, processing the 
wildcard path requires a temp directory.  If you aren't loading any files from 
the wildcard include path (ie. /etc/gmond/conf.d/*) then just remove the 
include statement from the .conf file and everything should work fine in a 
readonly environment.  The reason why gmond kept running but you had problems 
with gmetric is because gmond had already processed the wildcard path before 
the filesystem switched to readonly.  Every time gmetric starts, it needs to 
re-read the .conf and process the wildcard path.  

Brad


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetric fails when disk is unwriteable?

2008-11-24 Thread Brad Nicholes
 On 11/24/2008 at 3:47 PM, in message [EMAIL PROTECTED],
Ofer Inbar [EMAIL PROTECTED] wrote:
  I tried feeding one of my custom metrics by hand:
  [root ~]$ gmetric --name net_smtp_fin_wait2_out --value 0 --type uint8 
  --units 
  'connections'
  /etc/ganglia/gmond.conf:94: failed to determine the temp dir
  Parse error for '/etc/ganglia/gmond.conf'
  
  Then, I cd'ed over to a filesystem that is still in read/write mode:
  [root /otherfilesys]$ gmetric --name net_smtp_fin_wait2_out --value 0 
  --type 
 uint8 
  --units 'connections'
  
  No error, and it worked.
  
  What's the dependency that causes gmetric to require that the
  filesystem the CWD is on be writeable?  Does it really need that
  dependency?  It's great that Ganglia is so robust in the face of
  failures, but it'd be even better if gmetric were also as robust.
 
 Someone wrote me to suggest running it with strace, which is an
 obvious thing to do but unfortunately I didn't think of it at the time
 of the failure (it was late at night).  However, Brad knows the answer:
 
 Brad Nicholes [EMAIL PROTECTED] wrote:
 Both gmetric and gmond read the same .conf file.  If the .conf file
 has an include() statement that specifies a wildcard file path,
 processing the wildcard path requires a temp directory.  If you
 
 Removing the wildcard doesn't seem ideal, since it's something one
 might want to use and it's part of the standard config, so removing
 it and then forgetting that seems like a likely cause of confusion.
 Also, most people would never think to investigate something that's
 in the supplied conf file and doesn't seem to cause harm.  If we want
 robustness in the face of failure, having gmetric and gmond able to
 run without having to write to disk sounds like a better goal.  Is
 it doable?
 
 Why does it need to write to a temp directory to process a wildcard?
 
 Are there any other parts of gmond or gmetric that depend on being
 able to write to disk?  It seems that both of these programs should be
 able to avoid writing to disk entirely (except for swap/paging space
 on a memory-starved host).
   -- Cos

It needs a temp directory to get around some issues with libconfuse.  
Libconfuse doesn't actually support wildcard paths or files.  A libconfuse 
include statement must have a full path to the file that it is going include.  
So gmond makes up for this problem by creating a temp file, resolving all of 
the file paths and names and then writing them as separate includes in a temp 
file.  Then it tells libconfuse to include the temp file directly.  Without the 
ability to resolve the wildcard paths and write them to a temp file, the 
wildcarding feature of gmond wouldn't work.  To solve the problem that you are 
describing, we would have to actually add wildcard capability to libconfuse.

Brad


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] High system load when gmond is running

2008-11-13 Thread Brad Nicholes
 On 11/13/2008 at 4:08 PM, in message [EMAIL PROTECTED],
[EMAIL PROTECTED] wrote:
 I looked into it further and it looks like my problem isn't gmond its  
 gmetad. If I just have gmond running without gmetad the system load is  
 normal but as soon as I start gmetad the load starts to go up. I ran  
 valgrind gmetad -d 1 and it looks like I have a memory leak in gmetad. Does  
 anyone know what I could try to fix the memory leak with gmetad?
 
 -Peter
 
 ==19924== Memcheck, a memory error detector.
 ==19924== Copyright (C) 2002-2006, and GNU GPL'd, by Julian Seward et al.
 ==19924== Using LibVEX rev 1658, a library for dynamic binary translation.
 ==19924== Copyright (C) 2004-2006, and GNU GPL'd, by OpenWorks LLP.
 ==19924== Using valgrind-3.2.1, a dynamic binary instrumentation framework.
 ==19924== Copyright (C) 2000-2006, and GNU GPL'd, by Julian Seward et al.
 ==19924== For more details, rerun with: -v
 ==19924==
 Sources are ...
 Source: [my cluster, step 15] has 1 sources
 192.168.1.254
 tcp_listen() on xml_port failed: Address already in use
 ==19924==
 ==19924== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 4 from 1)
 ==19924== malloc/free: in use at exit: 10,118 bytes in 101 blocks.
 ==19924== malloc/free: 170 allocs, 69 frees, 20,163 bytes allocated.
 ==19924== For counts of detected errors, rerun with: -v
 ==19924== searching for pointers to 101 not-freed blocks.
 ==19924== checked 208,416 bytes.
 ==19924==
 ==19924== LEAK SUMMARY:
 ==19924== definitely lost: 465 bytes in 4 blocks.
 ==19924== possibly lost: 0 bytes in 0 blocks.
 ==19924== still reachable: 9,653 bytes in 97 blocks.
 ==19924== suppressed: 0 bytes in 0 blocks.
 ==19924== Use --leak-check=full to see details of leaked memory.
 
 

I guess the more obvious question is, why are you getting:

 tcp_listen() on xml_port failed: Address already in use

What else is listening on the ports that gmetad is trying to open?  These ports 
by default are 8651 and 8652.  Do you have another gmetad running on the same 
box or is some other service configured to listen on these same ports?

Brad




-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] cluster graphing stops entirely after mastergmond restart

2008-11-11 Thread Brad Nicholes
 On 11/10/2008 at 6:11 PM, in message [EMAIL PROTECTED],
Ofer Inbar [EMAIL PROTECTED] wrote:
 Brad Nicholes [EMAIL PROTECTED] wrote:
 The reason why is because with the introduction of the modular
 metric functionality, metric metadata is now passed between gmonds
 rather than it being hardcoded into every gmond.  In multicast mode,
 if you restart the master gmond, it has to request from and wait for
 each sub-gmond that is listening on the same multicast channel, to
 respond with its metadata for each metric it supports.  Depending on
 the reporting interval for a collection group, this could take
 anywhere from a few seconds to several minutes.  In unicast mode the
 
 However, as we discovered a couple of months ago, requesting new
 metadata doesn't always work as designed, and it can take hours
 to get everything it needs.  Restarting the *other* gmonds in the
 cluster is sort of a workaround, because each one will send its
 metadata when it is restarted.  This way, you can at least ensure that
 the least recently restarted gmond knows everything.
   -- Cos

That's interesting.  I would like to investigate this further since I haven't 
seen the same problem.  In my testing, granted I don't have very large clusters 
to test with, the complete metadata resync time has only taken as long as the 
long longest collection_group interval (ie. time_threshold value).  If you 
don't have any collection_groups that have a time_threshold on the order of 
hours, then there is something we need to investigate further.  It will just be 
a little more difficult because I can't duplicate the delay in my testing 
environment.

BTW, another workaround if you don't mind the additional UDP traffic, is to set 
the send_metadata_interval value anyway, even in multicast mode.  All it will 
do is ensure that the metadata is sent on an interval rather than just on 
request.  This might be a good idea if you are restarted gmond's very often.

Brad


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] cluster graphing stops entirely after mastergmond restart

2008-11-10 Thread Brad Nicholes
 On 11/10/2008 at 3:26 PM, in message [EMAIL PROTECTED],
Brad Fino [EMAIL PROTECTED] wrote:
 If I restart gmond on the master node that a cluster reports to, the entire
 cluster stops graphing entirely.  Some nodes in the cluster start graphing
 immediately after a node gmond restart, and some do not.  Some graph partial
 statistics.  It usually takes 2-3 restarts to get the entire cluster
 graphing properly again.  Even if I leave the nodes be for 5,10,30 minutes
 they don't start graphing again until a gmond restart on the node.  
 
  
 
 I don't remember this behavior in pre 3.1.0 gmond / gmetad.  In 3.0.6 if I
 restarted a master gmond then the cluster would just pick right up again;
 here it just flat stops graphing.  The nodes aren't reported as being
 offline.  The old stats and metrics are still all there.  It just stops
 graphing new data.
 

The reason why is because with the introduction of the modular metric 
functionality, metric metadata is now passed between gmonds rather than it 
being hardcoded into every gmond.  In multicast mode, if you restart the master 
gmond, it has to request from and wait for each sub-gmond that is listening on 
the same multicast channel, to respond with its metadata for each metric it 
supports.  Depending on the reporting interval for a collection group, this 
could take anywhere from a few seconds to several minutes.  In unicast mode the 
global directive send_metadata_interval must be set to something greater than 
0.  The value of this directive is the interval in second at which gmond will 
send its metric metadata to the master gmond.

Brad


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] nonresponsive gmond

2008-11-04 Thread Brad Nicholes
 On 11/3/2008 at  5:27 PM, in message [EMAIL PROTECTED],
Kostas Georgiou [EMAIL PROTECTED] wrote: 
 On Mon, Nov 03, 2008 at 11:46:52PM +, Kostas Georgiou wrote:
 
 On Mon, Nov 03, 2008 at 01:55:22PM -0700, Brad Nicholes wrote:
  
  If a timeout is set, then is the resulting XML output still good or did we 
 lose something because of the timeout?
 
 No, it seems to be working fine. I am testing with:
 
 Actually I was wrong there was enough data in the socket buffers to
 confuse me. The xml output is truncated in the slow reader :(
 

Attached is a patch against trunk which implements a lingering close.  I am not 
sure if this will solve the problem but Apache does a similar thing to make 
sure that both sides get a chance to complete the conversation before closing 
the socket.  Apply this patch, let it run for a while and let's see if this 
solves the problem.

Brad

Index: gmond.c
===
--- gmond.c (revision 1883)
+++ gmond.c (working copy)
@@ -1498,6 +1498,76 @@
   return apr_socket_send(client, /HOST\n, len); 
 }
 
+/* we now proceed to read from the client until we get EOF, or until
+ * MAX_SECS_TO_LINGER has passed.  the reasons for doing this are
+ * documented in a draft:
+ *
+ * http://www.ics.uci.edu/pub/ietf/http/draft-ietf-http-connection-00.txt
+ *
+ * in a nutshell -- if we don't make this effort we risk causing
+ * TCP RST packets to be sent which can tear down a connection before
+ * all the response data has been sent to the client.
+ */
+#define MAX_SECS_TO_LINGER 30
+#define SECONDS_TO_LINGER  2
+void lingering_close(apr_socket_t *csd)
+{
+char dummybuf[512];
+apr_size_t nbytes;
+apr_time_t timeup = 0;
+
+if (!csd) {
+return;
+}
+
+#ifdef NO_LINGCLOSE
+ap_flush_conn(c); /* just close it */
+apr_socket_close(csd);
+return;
+#endif
+
+/* Close the connection, being careful to send out whatever is still
+ * in our buffers.  If possible, try to avoid a hard close until the
+ * client has ACKed our FIN and/or has stopped sending us data.
+ */
+
+/* Shut down the socket for write, which will send a FIN
+ * to the peer.
+ */
+if (apr_socket_shutdown(csd, APR_SHUTDOWN_WRITE) != APR_SUCCESS) 
+{
+apr_socket_close(csd);
+return;
+}
+
+/* Read available data from the client whilst it continues sending
+ * it, for a maximum time of MAX_SECS_TO_LINGER.  If the client
+ * does not send any data within 2 seconds (a value pulled from
+ * Apache 1.3 which seems to work well), give up.
+ */
+apr_socket_timeout_set(csd, apr_time_from_sec(SECONDS_TO_LINGER));
+apr_socket_opt_set(csd, APR_INCOMPLETE_READ, 1);
+
+/* The common path here is that the initial apr_socket_recv() call
+ * will return 0 bytes read; so that case must avoid the expensive
+ * apr_time_now() call and time arithmetic. */
+
+do {
+nbytes = sizeof(dummybuf);
+if (apr_socket_recv(csd, dummybuf, nbytes) || nbytes == 0)
+break;
+
+if (timeup == 0) {
+/* First time through; calculate now + 30 seconds. */
+timeup = apr_time_now() + apr_time_from_sec(MAX_SECS_TO_LINGER);
+continue;
+}
+} while (apr_time_now()  timeup);
+
+apr_socket_close(csd);
+return;
+}
+
 static void
 process_tcp_accept_channel(const apr_pollfd_t *desc, apr_time_t now)
 {
@@ -1584,8 +1654,9 @@
 
   /* Close down the accepted socket */
 close_accept_socket:
-  apr_socket_shutdown(client, APR_SHUTDOWN_READ);
-  apr_socket_close(client);
+  lingering_close(client);
+  //apr_socket_shutdown(client, APR_SHUTDOWN_READ);
+  //apr_socket_close(client);
   apr_pool_destroy(client_context);
 }
 
-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] ganglia vs. top: running processes

2008-10-28 Thread Brad Nicholes
 On 10/24/2008 at 7:57 AM, in message
[EMAIL PROTECTED], Ofer Inbar [EMAIL PROTECTED] wrote:
 Recently we noticed something we don't know the explanation for, on a
 CentOS4 for system running gmond 3.1.0: The Ganglia graph shows a line
 for running processes that sometimes spikes to 10, 20, or higher,
 and often stays at that high level for several samples in a row; but
 top reports that processes running is 1-4 and occasionally 5 or 6,
 but never shows a number as high as 10 even if we watch it for a while,
 while the graph updates and clearly shows spikes.
 
 We found one cause of those spikes and fixed it, reducing their size
 and number, so we don't think Ganglia is fabricating it.  It's showing
 something that's really happening.  But why the disagreement with top?


One reason for this is probably because Ganglia specifically looks for spikes 
and then reports them.  This is what the value_threshold directive in the .conf 
file is for.  If the delta between the previous value and the current value 
ever spikes beyond a specified percentage, then gmond will immediately send the 
values for the collection_group that the metric belongs to.  This ensures that 
unexpected spikes are not missed.  I'm not sure what top is doing in this same 
case.  It may be ignoring spikes and simply displaying an average.

Brad


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] About python module for gmond (multiple metricsper one single call back? dynamic pyconf?)

2008-10-22 Thread Brad Nicholes
 On 10/20/2008 at 11:19 PM, in message
[EMAIL PROTECTED], utopia zh
[EMAIL PROTECTED] wrote:
 Hi,
 
 I'm recently working on the gmond python mode. I found that for some
 metrics, it will be beneficial if we can return multiple metric values in
 the single callback.
 
 For example, if we want to get usage information about disks (total, used,
 free),  we can get these values via a single statvfs call, but in order to
 send these metrics out using python module, we need to call statvfs for 3
 times.
 
 There maybe some more time-consuming examples. For example, we may need to
 parse a xml to get some values, it would save a lot of efforts if we can
 fill the value of multiple metric at the same time.
 
 I thought that maybe OK for python module if we just return
 Value1:Values2:Value3, but in that case, we will not be able to get pictures
 from ganglia/rrdtools.
 

The way to do this is to spawn a thread in your python module that gathers all 
of the metrics at once and then just caches them.  Then when each of your 
metric handlers are called, they each simply return the cached value.  Take a 
look at the tcpconn.py module.  This is the way that it works.  It calls 
netstat once to gather all of the various values.  Then the different metric 
handler just read the current values from the cache.

 Another question about python module is that: for some dynamically changed
 metrics (e.g. we may need to handle adding/removing storage devices), could
 we add/remove metric entry in pyconf on the fly? I noticed that using gmond
 -m will be able to get all entries of metrics according to metric_init,  is
 there any way to let gmond collect metrics according to metric entries given
 by metric_init of python script (instead of from conf.d/*.pyconf)?  Or we
 can go still further to let gmond call metric_init every period of time?
 
 Any comments? Thanks.
 

This is an issue that I have looked into a few times but haven't been able to 
come up with a solution without re-architecting the way that gmond creates its 
internal list of metrics.  This would be a nice feature to have.  The spoofing 
feature for module metrics has a dynamic element to it so there may be 
something that we can do to extend that functionality for general metrics.  But 
we need to get the spoofing feature backported to 3.1.1 first which is still 
waiting for review and approval.

Brad

 
 
 Regards,
 
 Hang




-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] metric name and spoofed metrics

2008-10-06 Thread Brad Nicholes
 On 10/3/2008 at 12:23 PM, in message
[EMAIL PROTECTED], Martin Hicks
[EMAIL PROTECTED] wrote:

 (sorry if this is a duplicate.  I sent it yesterday but I haven't seen
 it come back yet, nor has it shown up in the mailing list archives on
 sourceforge)
 
 Hi,
 
 I backported the spoofing patches to my 3.1.1 build (I also check this
 against trunk to make sure I hadn't missed something) in order to play
 with python DSO and spoofing.
 
 I ran into a question.  If you have the same metric for a bunch of
 different hosts, just with different SPOOF_HOST, then how do you tell
 the difference in the call_back?  You just get the 'name', which appears
 to always be the same.
 
[SNIP]
 
 Am I expected to deal with each SPOOF_HOST when a call_back occurs for a
 particular metric name?
 
 I kind of expected SPOOF_NAME to be in the 'name' argument of the
 call_back.
 

For the spoof modules that I wrote, I appended the host name to the name of the 
metric and used it as a unique metric identifier

metric_name = name + ':' + host_name

Then in the handler I just call 

metric_keys = name.split(':')

and I end up with an array that uniquely identifies which metric for which 
host.  Finally, to make sure that the concatenation of the 
metric_name:host_name does not show up in the front end as the title of the 
graph, I make sure to assign a title to the metric in the gmond.conf file.

This is just one way to uniquely identify any spoofed metric.  Since gmond has 
no idea what a metric module is doing or what naming convention is being used, 
it simply leaves all of that up to the module itself rather than trying to 
enforce some policy.

Brad


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] metric name and spoofed metrics

2008-10-06 Thread Brad Nicholes
 On 10/6/2008 at 2:23 PM, in message
[EMAIL PROTECTED], Martin Hicks [EMAIL PROTECTED]
wrote:

 On Mon, Oct 06, 2008 at 11:11:51AM -0600, Brad Nicholes wrote:
  On 10/3/2008 at 12:23 PM, in message
  
  Am I expected to deal with each SPOOF_HOST when a call_back occurs
  for a particular metric name?
  
  I kind of expected SPOOF_NAME to be in the 'name' argument of the
  call_back.
  
 
 For the spoof modules that I wrote, I appended the host name to the
 name of the metric and used it as a unique metric identifier
 
 metric_name = name + ':' + host_name
 
 Then in the handler I just call 
 
 metric_keys = name.split(':')
 
 and I end up with an array that uniquely identifies which metric for
 which host.  Finally, to make sure that the concatenation of the
 metric_name:host_name does not show up in the front end as the title
 of the graph, I make sure to assign a title to the metric in the
 gmond.conf file.
 
 This is just one way to uniquely identify any spoofed metric.  Since
 gmond has no idea what a metric module is doing or what naming
 convention is being used, it simply leaves all of that up to the
 module itself rather than trying to enforce some policy.
 
 Does this not also mean I need to know at gmond start-up time which
 hosts the module will be spoofing for and rewrite
 /etc/ganglia/conf.d/module.pyconf to reflect the list of nodes that
 will need to be spoofed?
 

No.  Sorry, the colon separated name/id actually has more significance than I 
stated previously.  The metric_name:host_name is actually a derivative of the 
same style used by gmetric in the --spoof=STRING parameter.  Part of the patch 
to gmond.c was to add a function called get_metric_names() which for a spoofed 
metric, specifically looks for a colon separated metric name, pulls the first 
name from the string and uses that to match against the name that is referenced 
in the gmond.conf file.  So for example if your metric name is my_foo_metric 
then the name that is referenced in gmond.conf should also be my_foo_metric.  
But the name that your module actually assigns as the metric name in the metric 
definition structure should be my_foo_metric:some_host_or_other_id.  This 
needs to be documented in the README.in file that explains how to write a 
python metric module.  I also plan on adding this same documentation to the 
wiki once the patch has been backported.

Brad


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] mod_python on Solaris not scanning directory

2008-09-25 Thread Brad Nicholes
 On 9/25/2008 at 6:08 AM, in message
[EMAIL PROTECTED], Gilad Raphaelli
[EMAIL PROTECTED] wrote:

 
 
 
 
 - Original Message 
 From: Brad Nicholes [EMAIL PROTECTED]
 To: Lieting Yu [EMAIL PROTECTED]; ganglia-general@lists.sourceforge.net; 
 Gilad Raphaelli [EMAIL PROTECTED]
 Sent: Thursday, September 25, 2008 1:05:11 AM
 Subject: Re: [Ganglia-general] mod_python on Solaris not scanning  directory
 
  On 9/23/2008 at  7:03 PM, in message
 [EMAIL PROTECTED], Gilad Raphaelli
 wrote: 
  Lieting,
  
  I believe I ran into the same issue and cleared it up with this patch to 
  mod_python.c:
  
  --- mod_python.c.orig   2008-09-24 10:52:17.0 +1000
  +++ mod_python.c2008-09-24 10:59:06.0 +1000
  @@ -510,6 +510,11 @@
   /* Set up the python path to be able to load module from our module 
  path */
   set_python_path(path);
   Py_Initialize();
  +
  +PyObject *sys_path = PySys_GetObject(path);
  +PyObject *addpath = PyString_FromString(path);
  +PyList_Append(sys_path, addpath);
  +
   PyEval_InitThreads();
   gtstate = PyEval_SaveThread();
  
  This does the equivalent of a 'sys.path.append(path)' without any error 
  handling.  I haven't heard back from the developer list whether this is 
  the 
 
  right approach but perhaps it will get you going for now.
  
 
 The call to the function set_python_path(path) should have done the same 
 thing 
 by setting up the PYTHONPATH environment variable.  Are you suggesting that 
 we 
 add the above code as well or that the above code should replace the code 
 that 
 already exists in set_python_path()?
 
 I'm not exactly an authority on embedding the python interpreter in C but I 
 believe I ran into the same issue that Lieting is reporting, on a rhel4 box, 
 and solved it with the code above.  Coincidentally, the question of the 
 'correct' way to set the module import path came up just a few days before I 
 had this problem - 
 http://mail.python.org/pipermail/python-list/2008-September/506206.html.  I 
 think that just doing the above without setting the PYTHONPATH environment 
 variable should work regardless of the platform.
 

OK. Do you know which version of Python introduced the PySys_GetObject() 
function?  It seems to be an undocumented function until python 2.6.  It might 
be good to know what the issue is with this function and why it wasn't 
documented before we start using it.

Brad


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] mod_python on Solaris not scanning directory

2008-09-24 Thread Brad Nicholes
 On 9/23/2008 at  7:03 PM, in message
[EMAIL PROTECTED], Gilad Raphaelli
[EMAIL PROTECTED] wrote: 
 Lieting,
 
 I believe I ran into the same issue and cleared it up with this patch to 
 mod_python.c:
 
 --- mod_python.c.orig   2008-09-24 10:52:17.0 +1000
 +++ mod_python.c2008-09-24 10:59:06.0 +1000
 @@ -510,6 +510,11 @@
  /* Set up the python path to be able to load module from our module 
 path */
  set_python_path(path);
  Py_Initialize();
 +
 +PyObject *sys_path = PySys_GetObject(path);
 +PyObject *addpath = PyString_FromString(path);
 +PyList_Append(sys_path, addpath);
 +
  PyEval_InitThreads();
  gtstate = PyEval_SaveThread();
 
 This does the equivalent of a 'sys.path.append(path)' without any error 
 handling.  I haven't heard back from the developer list whether this is the 
 right approach but perhaps it will get you going for now.
 

The call to the function set_python_path(path) should have done the same thing 
by setting up the PYTHONPATH environment variable.  Are you suggesting that we 
add the above code as well or that the above code should replace the code that 
already exists in set_python_path()?

Brad


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [Ganglia-developers] can't get cpu_num toshow for whole cluster

2008-09-15 Thread Brad Nicholes
 On 9/12/2008 at 11:48 AM, in message
[EMAIL PROTECTED], Bernard
Li
[EMAIL PROTECTED] wrote:
 Hi all:
 
 On Fri, Sep 12, 2008 at 10:00 AM, Ofer Inbar [EMAIL PROTECTED] wrote:
 
 I added a host to an existing cluster, and noticed the total number
of
 CPU cores for the cluster fluctuate, so I tried restarting all the
 gmond's in the cluster... but that just made most of the CPU's
appear
 to disappear from Ganglia metrics altogether.

 I narrowed it down to this:
  each gmond only reports cpu_num for nodes that restarted after it.

 If I restart gmond on node1, it reports cpu_num for itself only,
even
 though other gmonds in the cluster are reporting cpu_num for other
 nodes.  If I restart gmond on node2, node1 will now report cpu_num
for
 itself and for node2 ... but node2 has now forgotten cpu_num for
all
 other nodes except itself.  And so on.

 It's a catch-22.  I can't make them all see every node's metrics.

 Ganglia 3.1.0 on CentOS, using multicast only.
 
 Since I have a 3.0.7 installation handy, Cos suggested I do an
experiment:
 
 1) WIth Ganglia 3.1.1, I restart a gmond that is not listed in the
 data_source and I nc the host checking for the number of cpu_num
lines
 in the XML stream.  This number stays 1 until quite a while (maybe
10
 mins).
 2) With Ganglia 3.0.7, I did the same test as above, and immediately
 after restarting gmond the number of cpu_num lines was already at 3,
 and quite quickly grows in a matter of minutes
 
 When I first tried 3.1.x, I always thought it odd that when I
restart
 a gmond, I had to restart *all* the rest of the gmonds to get the
 right number of total cpus, I guess this confirms my suspicion.
 
 If this is indeed a bug/unwanted new behaviour, please discuss this
in
 ganglia-developers.
 

I am wondering if this might be an issue with the way that the metadata
for a metric is being sent.  The unique attribute about this is that
cpu_num is a collect_once metric.  This means that if the data value is
sent but one of the gmond's in the cluster has not received the metadata
yet, the value may get ignored when the XML is written.  One interesting
test to try to validate this theory would be to set the
send_metadata_interval to something greater than zero even in a
multicast environment.  Then run your same test and see if the same
problem shows up or goes away.  If the problem goes away, then we might
have to rework how the metadata data is being requested and sent in a
multicast environment.

Brad

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


[Ganglia-general] [ANNOUNCEMENT] Official release of Ganglia 3.1.1

2008-09-09 Thread Brad Nicholes
The Ganglia Project (http://ganglia.info) is pleased to announce the  
official release of Ganglia 3.1.1  The official tarball is available for 
immediate download at:

http://sourceforge.net/project/showfiles.php?group_id=43021package_id=35280release_id=625044
 

For a full description of the bug fixes and enhancements that are included
in the 3.1.1 release as well as upgrade information, please see the current 
release notes at:

http://ganglia.wiki.sourceforge.net/ganglia_release_notes 


Supported platforms:

  * Linux (Fedora/RedHat/CentOS, Debian, Gentoo, SuSE/OpenSuSE)
  * [Open]Solaris
  * FreeBSD
  * NetBSD
  * OpenBSD
  * DragonflyBSD
  * Cygwin (no support for DSO yet)
  * AIX (no support for DSO yet)

Please read all the README, INSTALL and other available documentation 
(http://ganglia.wiki.sourceforge.net) as a lot of things have changed since 
version 3.0. Use good deployment practices when upgrading from 3.0 to make sure 
that you do not mix gmond 3.0 and 3.1 nodes in the same cluster (as defined by 
a multicast address or unicast collector node).  The protocol that allows gmond 
nodes to communicate within the same cluster, has changed.  However the XML 
packets that are passed between gmond and gmetad have remained compatible from 
3.0.x to 3.1.x, allowing a 3.1.x gmetad to continue to pull data from an older 
3.0.x gmond cluster.

Ganglia Development Team



-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetad bug when gmond host hangs

2008-09-02 Thread Brad Nicholes
 On 9/1/2008 at  3:35 PM, in message [EMAIL PROTECTED], Carlo
Marcelo Arenas Belon [EMAIL PROTECTED] wrote:
 On Sat, Aug 30, 2008 at 10:09:02AM -0600, Brad Nicholes wrote:
  On 8/30/2008 at 12:25 AM, in message [EMAIL PROTECTED], Carlo
 Marcelo Arenas Belon [EMAIL PROTECTED] wrote:
  On Fri, Aug 29, 2008 at 02:40:00PM -0400, Ofer Inbar wrote:
  Should this have made it into 3.1, or 3.1.1?  It
  doesn't look like it.
  
  There is a fix in trunk now with r1738 and unless something goes wrong with
  it, will be most likely released with 3.1.2 and 3.0.8.
 
 The proposed backport patch for 3.1.x has been updated in the BUG and
 officially requested for inclusion in 3.1 (beware it includes 1 extra
 unrelated change that is needed to prevent future conflicts for backporting
 but that is otherwise mostly irrelevant, specially if making your own 
 package)
 but also additional changes that simplify the logic and avoid a possible
 failure of logic which could result in gmetad crashing, so using this newer
 version is encouraged :
 
   
 http://bugzilla.ganglia.info/cgi-bin/bugzilla/attachment.cgi?id=165action=vi 
 ew
 
  3.1.1 is already in testing and since this bug is not a showstopper for 
 that
  specific release, I'd be surprised if the release manager decides it should
  be backported to it, but that shouldn't prevent you patching your own 
  package with the proposed patch if you don't want to wait.
 
 If we are confident enough that the patch for this bug solves the problem
 
 I am fairly confident that the patch resolves the problem reported in BUG92
 and that matches the description of the problem from Cos, and that is easy
 to replicate by just sending a SIGSTOP to gmond; otherwise I wouldn't have
 proposed it here, in bugzilla and the STATUS file.
 
 then I have no problem scraping 3.1.1, tagging and rolling 3.1.2 and
 restarting the testing period.  The whole point of the testing period is
 to flush out problems like this and then determine if the fix is important
 enough to retag and retest.
 
 agree, but doing so will delay releasing the next version of 3.1 and also
 indirectly (as I won't start that until 3.1 is out to avoid confusion and
 overstressing our limited testing resources) the next release for 3.0.
 
 3.1.1 includes a fix for a showstopper bug (gmetad crash in hierarchical
 configuration) and a fix for a high bug (instability with tcpconn.py) which
 had been already backported in fedora and gentoo distribution packages, but 
 not
 debian, our own packages or anyone using 3.1 from sources (all other
 architectures where there are yet no packages available) and that are
 therefore waiting for a 3.1.1 release.
 
 We need some feedback on how serious this problem is
 

Thanks Carlo, this is some good feedback.  I know that both Bernard and Cos 
have reported having issues with this bug.  Could either (or both) of you 
independently confirm that this patch fixes the problem?  Also, is this issue 
serious enough to stop 3.1.1, add the patch to a 3.1.2 tarball and restart the 
testing period?  Since we are only a week into the current testing cycle, this 
would delay the release of 3.1.2 by a week (considering a two week testing 
cycle for 3.1.2).  Given the bug fixes that were mentioned by Carlo that have 
already been included in the 3.1.1 tarball, is a week's delay for 3.1.2 which 
would include this gmetad patch, worth it?  The alternative would be to 
continue with 3.1.1 as scheduled and have this patch wait for the next release 
cycle which will probably be at least a couple of months out.  I need to know 
quickly so that we can get moving on a 3.1.2 tarball if that is the consensus.  
If I don't hear any feedback on this issue by Thurs (9/4), I wil
 l assume that we are good with 3.1.1 and allow the testing cycle to continue 
as scheduled.

Opinions?  Feedback? Comments?

Brad


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetad bug when gmond host hangs

2008-08-30 Thread Brad Nicholes
 On 8/30/2008 at 12:25 AM, in message [EMAIL PROTECTED], Carlo
Marcelo Arenas Belon [EMAIL PROTECTED] wrote:
 On Fri, Aug 29, 2008 at 02:40:00PM -0400, Ofer Inbar wrote:
 Should this have made it into 3.1, or 3.1.1?  It
 doesn't look like it.
 
 There is a fix in trunk now with r1738 and unless something goes wrong with
 it, will be most likely released with 3.1.2 and 3.0.8.
 
 3.1.1 is already in testing and since this bug is not a showstopper for that
 specific release, I'd be surprised if the release manager decides it should
 be backported to it, but that shouldn't prevent you patching your own 
 package
 with the proposed patch if you don't want to wait.
 

If we are confident enough that the patch for this bug solves the problem, then 
I have no problem scraping 3.1.1, tagging and rolling 3.1.2 and restarting the 
testing period.  The whole point of the testing period is to flush out problems 
like this and then determine if the fix is important enough to retag and 
retest.  We need some feedback on how serious this problem is and how confident 
we are in the fix. I would rather throw away 3.1.1 now in favor of a fixed 
3.1.2 than half to do this all over again next month.  

Brad


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


[Ganglia-general] [ANNOUNCEMENT] Ganglia 3.1.1 tarball ready for testing...

2008-08-25 Thread Brad Nicholes
In an effort to continue improving the Ganglia software, the Ganglia Project  
has released an official testing release of Ganglia 3.1.1.  The testing tarball 
is available for immediate download at:

http://www.ganglia.info/testing/ 

The intent of this testing release of Ganglia 3.1.1 is to validate that 
the source code is stable and that the bug fixes and enhancements that have
been added since the previous release of the software, are ready for general 
release.  The release procedure from this point has been documented on the 
Ganglia wiki site at http://ganglia.wiki.sourceforge.net/ganglia_works under 
the heading Generating a Release Candidate and GA Release.  

Basically the Ganglia 3.1.1 testing tarball has been rolled and made available 
for testing by the Ganglia community.  All bugs found in this testing release 
should be immediately reported through bugzilla (http://bugzilla.ganglia.info) 
and can be posted to the [EMAIL PROTECTED] mailing list 
as well.  If the bug report is also accompanied by a bug fix patch, this will 
help avoid delays in producing new testing tarballs and ultimately an official 
general release of the software.  If any critical level bugs are discovered, 
the current testing release tarball will be thrown away and a new tarball will 
be rolled and made available for further testing.  Once a testing release 
tarball has been validated by the Ganglia community to be stable and ready for 
general availability, that tarball will become the official Ganglia 3.1.x 
release.  So basically the sooner we are able to test and validate the Ganglia 
3.1 source code, the sooner the project will be able to create an official 
release.  But we need your help to get this done.  Any and all testing and 
feedback, positive or negative, will be greatly appreciated.

There will be a two week testing period for this 3.1.1 tarball which begins from
the date of this announcement.  So please help us to make sure that the tarball
is valid and stable by building and installing it on any size of testing 
environment.

Known issues with this testing release will be addressed on the Ganglia wiki
site at:

http://ganglia.wiki.sourceforge.net/Testing_Release_Notes 


For those who are interested in upgrading from a current 3.0.x installation, 
please see the current release notes at:

http://ganglia.wiki.sourceforge.net/ganglia_release_notes 

Supported platforms (additional testing requested):

  * Linux (Fedora/RedHat/CentOS, Debian, Gentoo, SuSE/OpenSuSE)
  * [Open]Solaris
  * FreeBSD
  * NetBSD
  * OpenBSD
  * DragonflyBSD
  * Cygwin (no support for DSO yet)
  * AIX (no support for DSO yet)

Please read all the README, INSTALL and other available documentation 
(http://ganglia.wiki.sourceforge.net) as a lot of things have changed since 
3.0.7. Use good deployment practices when upgrading from 3.0.x to make sure 
that you do not mix gmond 3.0 and 3.1 nodes in the same cluster (as defined by 
a multicast address or unicast collector node).  The protocol that allows gmond 
nodes to communicate within the same cluster, has changed.  However the XML 
packets that are passed between gmond and gmetad have remained compatible from 
3.0.x to 3.1.x, allowing a 3.0.x gmetad to continue to pull data from a newer 
3.1.x gmond cluster.

happy testing



-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Interrupted system call error shown by gmond -d 9

2008-08-15 Thread Brad Nicholes
 On 8/15/2008 at 4:11 PM, in message
[EMAIL PROTECTED], Sid Stuart
[EMAIL PROTECTED] wrote:
 Hi,
 
 Has anyone else seen this error when running gmond in debug mode (gmond -d
 9)?
 
 loaded module: python_module
 udp_recv_channel mcast_join=239.2.11.82 mcast_if=NULL port=8649 bind=
 239.2.11.82
 Exception in thread Thread-1:
 Traceback (most recent call last):
   File /usr/lib64/python2.4/threading.py, line 442, in __bootstrap
 self.run()
   File /usr/lib64/ganglia/python_modules/tcpconn.py, line 252, in run
 poll_events = fd_poll.poll()
 error: (4, 'Interrupted system call')
 
 tcp_accept_channel bind=NULL port=8649
 udp_send_channel mcast_join=239.2.11.82 mcast_if=NULL host=NULL port=8649
 
 I know tcpconn.py is has an issue with gmond -m, but this is with gmond in
 standard mode. I am looking at it because all the tcpconn graphs on my
 system display lines that zeroed (no values.)
 
 Note I am running the Fedora Ganglia RPM's and they have been buggy in the
 past. :)
 Sid

Yes, this has been fixed in trunk and the 3.1.x branch.  If you want to try it, 
just grab the latest version of tcpconn.py from SVN and copy it to the python 
modules directory.  Then restart gmond and see if that solves the problem.

Brad


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Patch - no_extra_data

2008-08-14 Thread Brad Nicholes
 On Thu, Aug 14, 2008 at  3:33 PM, in message
[EMAIL PROTECTED], Doug Nordwall [EMAIL PROTECTED]
wrote: 
 Here's a patch for ganglia. it allows the no_extra_data option to be  
 added to the config file. when this is set to yes, it will not send  
 any EXTRA_DATA or EXTRA_ELEMENTS in the xml.  when set to no (the  
 default) the normal ganglia output is kept.
 
 it is against 3.1.0. documentation has been patched as well.
 
 


I updated the patch so that it works against trunk.  I also changed the 
directive to be positive rather than negative (allow_extra_data vs 
no_extra_data).  Check out the new patch in bugzilla.

Also the documentation portion of the patch couldn't be applied.  The .pod 
files need to be updated.

Brad


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Debugging Gmond Python Metric Module

2008-08-13 Thread Brad Nicholes
 On 8/13/2008 at 10:42 AM, in message
[EMAIL PROTECTED], Sid Stuart
[EMAIL PROTECTED] wrote:
 


 After fixing a tabbing bug in your cacheHits() function, everything loaded
 fine and the callback function was called as it should be.  The callback
 didn't actually work on my system, but that is a different problem.  Make
 sure that you don't have any python syntax errors in your module.  Otherwise
 the module won't load.

 Brad

 Hi Brad,
 
 Thanks for going to all the effort. I think the tabbing bug was inserted by
 cut and paste as the cacheHits() function works on my system.
 
 I am beginning to believe the problem is with running any Python Metric
 Module. I have commented my module out of the system and inserted the
 tcpconn.pyconf into the the configuration. When I run gmond -d 9  in that
 configuration, I get a similar error message,
 
 gmond -d 9
 loaded module: core_metrics
 loaded module: cpu_module
 loaded module: disk_module
 loaded module: load_module
 loaded module: mem_module
 loaded module: net_module
 loaded module: proc_module
 loaded module: sys_module
 udp_recv_channel mcast_join=239.2.11.82 mcast_if=NULL port=8649 bind=
 239.2.11.82
 tcp_accept_channel bind=NULL port=8649
 udp_send_channel mcast_join=239.2.11.82 mcast_if=NULL host=NULL port=8649
 
 Unable to collect metric 'tcp_established' on this platform. Exiting.
 
 The tcp_established metric is the first metric in the tcpconn collection
 group. Are Python Metric Modules supposed to work with gmond version
 3.1.0.1399?
 
 Sid

Yes, python metric module support was one of the major features of Ganglia 3.1. 
 Which version of python are you using?  There are some issue with older 
versions previous to 2.4.  Did you try the -m parameter on gmond to see if your 
metric is showing up in the list?  In the debug listing that you provided, I 
don't see mod_python.so being loaded.  Are you sure that mod_python is 
configured and loaded?  You should have a modpython.conf file along with the 
rest of your .conf files in /etc/ganglia/conf.d.  If mod_python isn't even 
being loaded, then there isn't any python module support.

Brad

Brad


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Debugging Gmond Python Metric Module

2008-08-12 Thread Brad Nicholes
 On 8/12/2008 at 3:03 PM, in message
[EMAIL PROTECTED], Sid Stuart
[EMAIL PROTECTED] wrote:
 Hi,
 
 I have written a small Python metric module that contains one metric,
 CacheHits. When the module is included in the configuration, gmond spits out
 the following error message,
 
 Unable to collect metric 'CacheHits' on this platform. Exiting.
 
 As far as I can tell, gmond is parsing the configuration file and accepting
 the data in it, but cannot seem to find/run the handler for the metric. I
 have compiled and run the Python module and it works standalone. Is there
 any way to get gmond to provide more detail on what is failing?
 
 The configuration file and the python module are included below,
 

This message usually means that something was wrong with the metric definition 
and gmond doesn't recognize the metric name.  If you invoke gmond with a -m 
parameter to list out all of the metrics, does your CacheHits metric show up in 
the list?  If not then your module either isn't being loaded or the metric 
definition that is returned by your module has a problem.

Brad


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Debugging Gmond Python Metric Module

2008-08-12 Thread Brad Nicholes
 On 8/12/2008 at 3:03 PM, in message
[EMAIL PROTECTED], Sid Stuart
[EMAIL PROTECTED] wrote:
 Hi,
 
 I have written a small Python metric module that contains one metric,
 CacheHits. When the module is included in the configuration, gmond spits out
 the following error message,
 
 Unable to collect metric 'CacheHits' on this platform. Exiting.
 
 As far as I can tell, gmond is parsing the configuration file and accepting
 the data in it, but cannot seem to find/run the handler for the metric. I
 have compiled and run the Python module and it works standalone. Is there
 any way to get gmond to provide more detail on what is failing?
 

After fixing a tabbing bug in your cacheHits() function, everything loaded fine 
and the callback function was called as it should be.  The callback didn't 
actually work on my system, but that is a different problem.  Make sure that 
you don't have any python syntax errors in your module.  Otherwise the module 
won't load.

Brad


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [ANNOUNCEMENT] Official release of Ganglia 3.1.0

2008-07-31 Thread Brad Nicholes
  For those that haven't seen it yet, check out the press release that 
Groundwork Open Source did for the Ganglia 3.1 release.

http://www.marketwatch.com/news/story/ganglia-31-provides-easy-customize/story.aspx?guid=%7B39B62CFF-06F1-40DC-A00F-3F4A5B8FFEB1%7Ddist=hppr

Thanks to everybody that helped get Ganglia 3.1.0 out the door.

Brad

 On 7/30/2008 at 2:42 PM, in message [EMAIL PROTECTED], Brad
Nicholes [EMAIL PROTECTED] wrote:
 The Ganglia Project (http://ganglia.info) is pleased to announce the first 
 official release of Ganglia 3.1.0  The official tarball is available for 
 immediate download at:
 
 http://sourceforge.net/project/showfiles.php?group_id=43021package_id=35280r
  
 elease_id=616721 
 
 Please refer to http://ganglia.wiki.sourceforge.net/ganglia_release_notes 
 for more information.
 
 The main features of this release are:
 
   * Introduction of a modular metric interface for C and Python (DSO 
 support)
   * Scriptable metric module support with Python
   * All pre-existing metrics (CPU, network, disk, memory, etc.) converted 
  to metric modules
   * Introduction of new metric modules multicpu, multidisk and tcp_conn 
 status
   * Modular frontend graph support
   * Metric groups which can be viewed or hidden as desired   
   * Additional scaling capacity for systems with memory greater than 4TB
   * Platform support for DragonFlyBSD
   * Improved native metric support for Windows (Built with CygWin)
   * Bug fixes and Enhancements
 
 Supported platforms:
 
   * Linux (Fedora/RedHat/CentOS, Debian, Gentoo, SuSE/OpenSuSE)
   * [Open]Solaris
   * FreeBSD
   * NetBSD
   * OpenBSD
   * DragonflyBSD
   * Cygwin (no support for DSO yet)
   * AIX (no support for DSO yet)
 
 Please read all the README, INSTALL and other available documentation 
 (http://ganglia.wiki.sourceforge.net) as a lot of things have changed since 
 3.0.7. Use good deployment practices when upgrading from 3.0.x to make sure 
 that you do not mix gmond 3.0 and 3.1 nodes in the same cluster (as defined 
 by a multicast address or unicast collector node).  The protocol that 
 allows gmond nodes to communicate within the same cluster, has changed.  
 However the XML packets that are passed between gmond and gmetad have 
 remained compatible from 3.0.x to 3.1.x, allowing a 3.0.x gmetad to continue 
 
 to pull data from a newer 3.1.x gmond cluster.
 
 For those who are interested in upgrading from a 3.0.x installation, your 
 current gmond and gmetad configuration files will need to be moved from 
 their 
 current location to /etc/ganglia.  If you are attempting the upgrade via an 
 RPM, the RPM will automatically move your current configuration file to the 
 new location. However, for gmond, the 3.0.x conf file will not work. Please 
 use the patch file gmond-3.1.patch available at 
 http://www.ganglia.info/releases/ to patch your gmond.conf prior to 
 starting, otherwise gmond will fail to startup.
 
 There are several known issues with the current release which include the 
 following:
 
   * no support for C++ to create DSO modules
   * no spoofing from modular metrics (use gmetric if spoofing is needed)
   * race condition for tcpconn python metric module (affects gmond -m)
   * libdir issues related to building for 64bit platforms
   * known build issues for platforms:
- Darwin (AKA MacOS/X)
- HPUX
- Tru64 (AKA OSF/1)
- Irix
 
 Many of the above issues are being addressed and patches will be applied for 
 
 the next minor release of Ganglia 3.1.x.  In addition more information about 
 
 the current official release, can be found on the Ganglia wiki at 
 http://ganglia.wiki.sourceforge.net/ganglia_release_notes.
 
 
 Ganglia Development Team
 
 
 
 
 -
 This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
 Build the coolest Linux based applications with Moblin SDK  win great prizes
 Grand prize is a trip for two to an Open Source event anywhere in the world
 http://moblin-contest.org/redirect.php?banner_id=100url=/ 
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net 
 https://lists.sourceforge.net/lists/listinfo/ganglia-general 



-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


[Ganglia-general] [ANNOUNCEMENT] Official release of Ganglia 3.1.0

2008-07-30 Thread Brad Nicholes
The Ganglia Project (http://ganglia.info) is pleased to announce the first 
official release of Ganglia 3.1.0  The official tarball is available for 
immediate download at:

http://sourceforge.net/project/showfiles.php?group_id=43021package_id=35280release_id=616721
 

Please refer to http://ganglia.wiki.sourceforge.net/ganglia_release_notes 
for more information.

The main features of this release are:

  * Introduction of a modular metric interface for C and Python (DSO support)
  * Scriptable metric module support with Python
  * All pre-existing metrics (CPU, network, disk, memory, etc.) converted 
 to metric modules
  * Introduction of new metric modules multicpu, multidisk and tcp_conn status
  * Modular frontend graph support
  * Metric groups which can be viewed or hidden as desired   
  * Additional scaling capacity for systems with memory greater than 4TB
  * Platform support for DragonFlyBSD
  * Improved native metric support for Windows (Built with CygWin)
  * Bug fixes and Enhancements

Supported platforms:

  * Linux (Fedora/RedHat/CentOS, Debian, Gentoo, SuSE/OpenSuSE)
  * [Open]Solaris
  * FreeBSD
  * NetBSD
  * OpenBSD
  * DragonflyBSD
  * Cygwin (no support for DSO yet)
  * AIX (no support for DSO yet)

Please read all the README, INSTALL and other available documentation 
(http://ganglia.wiki.sourceforge.net) as a lot of things have changed since 
3.0.7. Use good deployment practices when upgrading from 3.0.x to make sure 
that you do not mix gmond 3.0 and 3.1 nodes in the same cluster (as defined 
by a multicast address or unicast collector node).  The protocol that 
allows gmond nodes to communicate within the same cluster, has changed.  
However the XML packets that are passed between gmond and gmetad have 
remained compatible from 3.0.x to 3.1.x, allowing a 3.0.x gmetad to continue 
to pull data from a newer 3.1.x gmond cluster.

For those who are interested in upgrading from a 3.0.x installation, your 
current gmond and gmetad configuration files will need to be moved from their 
current location to /etc/ganglia.  If you are attempting the upgrade via an 
RPM, the RPM will automatically move your current configuration file to the 
new location. However, for gmond, the 3.0.x conf file will not work. Please 
use the patch file gmond-3.1.patch available at 
http://www.ganglia.info/releases/ to patch your gmond.conf prior to 
starting, otherwise gmond will fail to startup.

There are several known issues with the current release which include the 
following:

  * no support for C++ to create DSO modules
  * no spoofing from modular metrics (use gmetric if spoofing is needed)
  * race condition for tcpconn python metric module (affects gmond -m)
  * libdir issues related to building for 64bit platforms
  * known build issues for platforms:
   - Darwin (AKA MacOS/X)
   - HPUX
   - Tru64 (AKA OSF/1)
   - Irix

Many of the above issues are being addressed and patches will be applied for 
the next minor release of Ganglia 3.1.x.  In addition more information about 
the current official release, can be found on the Ganglia wiki at 
http://ganglia.wiki.sourceforge.net/ganglia_release_notes.


Ganglia Development Team




-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [Ganglia-developers] [ANNOUNCEMENT] Ganglia 3.1.0 tarball ready fortesting...

2008-07-29 Thread Brad Nicholes
  It has been two weeks since the announcement of the availability of the 
Ganglia 3.1.0 testing tarball.  Since that time, I haven't seen any reports of 
showstopper issues.  Unless there are any objections or critical bug reports 
that have not yet been reported, I propose that we release the 3.1.0 tarball as 
first official release of the Ganglia 3.1.x series.  

Comments? Votes?


Brad

 On 7/15/2008 at 1:57 PM, in message [EMAIL PROTECTED], Brad
Nicholes [EMAIL PROTECTED] wrote:
 The Ganglia Project is pleased to announce the first official testing release 
 of Ganglia 3.1.x.  The testing tarball is available for immediate download 
 at:
 
 http://www.ganglia.info/testing/ 
 
 The intent of this first testing release of Ganglia 3.1.x is to validate 
 that the source code is stable and that the new feature set that is included 
 in the 3.1 version of the software, is ready for general release.  The 
 release procedure from this point has been documented on the Ganglia wiki 
 site at http://ganglia.wiki.sourceforge.net/ganglia_works under the heading 
 Generating a Release Candidate and GA Release.  
 
 Basically the Ganglia 3.1.0 testing tarball has been rolled and made 
 available for testing by the Ganglia community.  All bugs found in this 
 testing release should be immediately reported through bugzilla 
 (http://bugzilla.ganglia.info) and can be posted to the 
 [EMAIL PROTECTED] mailing list as well.  If the bug 
 report is also accompanied by a bug fix patch, this will help avoid delays in 
 producing new testing tarballs and ultimately an official general release of 
 the software.  If any critical level bugs are discovered, the current testing 
 release tarball will be thrown away and a new tarball will be rolled and made 
 available for further testing.  Once a testing release tarball has been 
 validated by the Ganglia community to be stable and ready for general 
 availability, that tarball will become the official Ganglia 3.1.x release.  
 So basically the sooner we are able to test and validate the Ganglia 3.1 
 source code, the sooner the project will be able to 
  create an official release.  But we need your help to get this done.  Any 
 and all testing and feedback, positive or negative, will be greatly 
 appreciated.
 
 There are several known issues with the current release which include the 
 following:
 
 * no support for C++ to create DSO modules
 * no spoofing from modular metrics (use gmetric if spoofing is needed)
 * race condition for tcpconn python metric module (affects gmond -m)
 * libdir issues related to building for 64bit platforms
 * known build issues for platforms:
- Darwin (AKA MacOS/X)
- HPUX
- Tru64 (AKA OSF/1)
- Irix
 
 Many of the above issues are being addressed and patches will be applied for 
 the next minor release of Ganglia 3.1.x.  In addition more information about 
 the current testing release or official release, can be found on the Ganglia 
 wiki at http://ganglia.wiki.sourceforge.net/ganglia_release_notes.
 
 For those who are interested in upgrading from a current 3.0.x installation, 
 your current gmond and gmetad configuration files will need to be moved from 
 there current location to /etc/ganglia.  If you are attempting the upgrade 
 via an RPM, the RPM will automatically move your current configuration file 
 to the new location. However, for gmond, the 3.0.x conf file will not work. 
 Please use the patch file gmond-3.1.patch (available from the testing URL 
 above) to patch your gmond.conf prior to starting, otherwise gmond will fail 
 to startup.
 
 The main features of this release are :
 
   * Dynamically loaded metric module support (DSO)
   * Scriptable metric module support with Python
   * Modular frontend graph support
   * Platform support for DragonFlyBSD
   * Improved native metric support for Windows (Built with CygWin)
   * Bug fixes and Enhancements
 
 Supported platforms (additional testing requested):
 
   * Linux (Fedora/RedHat/CentOS, Debian, Gentoo, SuSE/OpenSuSE)
   * [Open]Solaris
   * FreeBSD
   * NetBSD
   * OpenBSD
   * DragonflyBSD
   * Cygwin (no support for DSO yet)
   * AIX (no support for DSO yet)
 
 Please read all the README, INSTALL and other available documentation 
 (http://ganglia.wiki.sourceforge.net) as a lot of things have changed since 
 3.0.7. Use good deployment practices when upgrading from 3.0.x to make sure 
 that you do not mix gmond 3.0 and 3.1 nodes in the same cluster (as defined 
 by a multicast address or unicast collector node).  The protocol that allows 
 gmond nodes to communicate within the same cluster, has changed.  However the 
 XML packets that are passed between gmond and gmetad have remained compatible 
 from 3.0.x to 3.1.x, allowing a 3.0.x gmetad to continue to pull data from a 
 newer 3.1.x gmond cluster.
 
 happy testing
 
 
 -
 This SF.Net email is sponsored by the Moblin Your Move

Re: [Ganglia-general] [Ganglia-developers] [ANNOUNCEMENT]Ganglia 3.1.0 tarball ready fortesting...

2008-07-29 Thread Brad Nicholes
 On 7/29/2008 at 10:06 AM, in message
[EMAIL PROTECTED], Marc
Van Kerkhoven1 [EMAIL PROTECTED] wrote:
 Hi Brad,
 One minor bug would be that the gmetrics link is no longer visible in the 
 host view.   Not sure if this is because I have done anything wrong, but 
 it's pretty much a vanilla install.
 Not sure if this is intentional with the introduction of the python 
 modules.
 
 kind regards,
 Marc van Kerkhoven

Thanks for installing and testing the 3.1 testing tarball.  Removing the 
gmetric link from the host view was intentional.  In Ganglia 3.1, everything is 
reported as a metric rather than differentiating between a base metric vs a 
module metric vs a gmetric.  There was some discussion on the mailing list 
about this at http://www.mail-archive.com/[EMAIL PROTECTED]/msg04135.html .  So 
if gmetric is used to feed gmond with extra metric data, the gmetric will just 
show up along side of the rest of the standard metrics.

Brad


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [Ganglia-developers] [ANNOUNCEMENT]Ganglia 3.1.0 tarball ready fortesting...

2008-07-29 Thread Brad Nicholes
 On 7/29/2008 at 11:18 AM, in message
[EMAIL PROTECTED], Bernard Li
[EMAIL PROTECTED] wrote:
 Hi Brad:
 
 On Tue, Jul 29, 2008 at 9:27 AM, Brad Nicholes [EMAIL PROTECTED] wrote:
 
 Thanks for installing and testing the 3.1 testing tarball.  Removing the 
 gmetric link from the host view was intentional.  In Ganglia 3.1, everything 
 is reported as a metric rather than differentiating between a base metric vs 
 a module metric vs a gmetric.  There was some discussion on the mailing list 
 about this at 
 http://www.mail-archive.com/[EMAIL PROTECTED]/msg041 
 35.html .  So if gmetric is used to feed gmond with extra metric data, the 
 gmetric will just show up along side of the rest of the standard metrics.
 
 Maybe we should document this as part of the changes for 3.1.0 in the Wiki?
 

Sounds like a good idea.

Brad


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


[Ganglia-general] Current testing cycle for Ganglia 3.1 release...

2008-07-21 Thread Brad Nicholes
   Just as a reminder, there is currently a testing release of Ganglia 3.1 
available for immediate testing and feedback.  This testing release is 
available at:

http://www.ganglia.info/testing/ 

Please see the previous announcement for more information.  
http://www.mail-archive.com/[EMAIL PROTECTED]/msg04510.html 

The testing release has been available for a week and our current goal is to 
finish up testing and validation over the next week.  If any major bugs are 
discovered within this time period, we would like to address them and determine 
if a follow up testing release and testing period is necessary.  But we really 
need the community's help.  Please download the testing release, install it on 
a staging machine and give it a try.  Let us know how it goes and especially if 
you run into any major issues with installation or execution.  Any and all 
feedback, positive or negative, will be welcome.

thanks,
Brad


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetad giving high TN values

2008-06-25 Thread Brad Nicholes
 On 6/25/2008 at 12:13 PM, in message
[EMAIL PROTECTED], Kirk McDonald
[EMAIL PROTECTED] wrote:
 I have a gmetad which probes a number of gmonds, and each gmond has a
 number of hosts associated with it. When I scrape the XML from each of
 the gmonds probed by gmetad myself, the TN value for each host looks
 good (they average well under 10 seconds). However, when I scrape the
 XML from gmetad, the TN values for each host are much higher, enough
 so that it begins marking many of the hosts as down. I was wondering
 what could cause this to happen.
 

One possibility is the time difference between the gmond nodes and the gmetad 
host.  Gmetad will try to normalize all of the timestamps based on its own 
timestamp.  If there is a big time difference between a gmond node and the 
gmetad host, the calculation will be skewed. 

Brad


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetad giving high TN values

2008-06-25 Thread Brad Nicholes
 On 6/25/2008 at 1:18 PM, in message
[EMAIL PROTECTED], Kirk McDonald
[EMAIL PROTECTED] wrote:
 On Wed, Jun 25, 2008 at 11:48 AM, Brad Nicholes [EMAIL PROTECTED] wrote:
 On 6/25/2008 at 12:13 PM, in message
 [EMAIL PROTECTED], Kirk McDonald
 [EMAIL PROTECTED] wrote:
 I have a gmetad which probes a number of gmonds, and each gmond has a
 number of hosts associated with it. When I scrape the XML from each of
 the gmonds probed by gmetad myself, the TN value for each host looks
 good (they average well under 10 seconds). However, when I scrape the
 XML from gmetad, the TN values for each host are much higher, enough
 so that it begins marking many of the hosts as down. I was wondering
 what could cause this to happen.


 One possibility is the time difference between the gmond nodes and the 
 gmetad host.  Gmetad will try to normalize all of the timestamps based on its 
 own timestamp.  If there is a big time difference between a gmond node and 
 the gmetad host, the calculation will be skewed.

 Brad

 
 I do not think this is the problem. The problem becomes more and less
 apparent if I bring up or shut down gmond on portions of my hosts. If
 all of the gmonds on all of the hosts are up and running, the average
 TN creeps up and large portions of the grid are marked down. (The
 portions marked down appear to be more or less random.) If I shut down
 a significant portion of the gmonds, the average TN on gmetad drops,
 and the hosts are marked up. (That is, ignoring the TN on the machines
 I actually took down, which naturally rises continually.) I would
 expect a calculation error like that to be independent of the number
 of hosts being monitored. There is a definite correlation between
 average TN (as reported by gmetad) and the number of hosts being
 monitored.
 
 -Kirk

Are all of your nodes in a single cluster?  There may be some latency issues 
with the size of the XML that gmetad has to parse.  If you create multiple 
clusters and reference them through different data sources, gmetad will create 
multiple threads which only have to parse a portion of the whole.  If gmetad is 
on a multiproc box, the multiple threads can take better advantage of the cpus 
rather than parsing everything sequentially.

Brad


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Compiling gmond c module

2008-06-24 Thread Brad Nicholes
You will need to figure out where the u_short conflict is coming from. 
My first guess would be to use gcc rather than g++.

Brad

 On 6/7/2008 at 8:35 PM, in message
[EMAIL PROTECTED], Fábio
Firmo
[EMAIL PROTECTED] wrote:
 Hi everyone,
 
 I'm about to introduce Ganglia in a project to take care of the
 monitoring of clusters. We need to know the cpu and memory usage for
 some specifics processes, so I'm trying to write a C module for
gmond
 to do this.
 I've installed ganglia 3.1.0.1361 snapshot (devel, libganglia and
 gmond), but the mod_example doesn't compiles. Here the error:
 
 $ g++ mod_example.c -I/usr/include/apr-1/  #there's a problem
finding
 apr.h, so the need for -I
 In file included from /usr/include/gm_metric.h:7,
  from mod_example.c:33:
 /usr/include/gm_protocol.h:73: error: declaration of 'u_short
 Ganglia_gmetric_ushort::u_short'
 /usr/include/sys/types.h:36: error: changes meaning of 'u_short'
from
 'typedef __u_short u_short'
 /usr/include/gm_protocol.h:94: error: declaration of 'u_int
 Ganglia_gmetric_uint::u_int'
 /usr/include/sys/types.h:37: error: changes meaning of 'u_int' from
 'typedef __u_int u_int'
 mod_example.c:117: warning: deprecated conversion from string
constant
 to 'char*'
 mod_example.c:117: warning: deprecated conversion from string
constant
 to 'char*'
 mod_example.c:117: warning: deprecated conversion from string
constant
 to 'char*'
 mod_example.c:117: warning: deprecated conversion from string
constant
 to 'char*'
 mod_example.c:117: warning: deprecated conversion from string
constant
 to 'char*'
 mod_example.c:117: warning: deprecated conversion from string
constant
 to 'char*'
 mod_example.c:117: warning: deprecated conversion from string
constant
 to 'char*'
 mod_example.c:117: warning: deprecated conversion from string
constant
 to 'char*'
 mod_example.c:117: warning: deprecated conversion from string
constant
 to 'char*'
 mod_example.c:117: warning: deprecated conversion from string
constant
 to 'char*'
 mod_example.c:119: error: redefinition of 'mmodule example_module'
 mod_example.c:42: error: 'mmodule example_module' previously declared
here
 
 Can someone help me?
 
 Thanks in advance,
 Fábio
 

-
 Check out the new SourceForge.net Marketplace.
 It's the best place to buy or sell services for
 just about anything Open Source.
 http://sourceforge.net/services/buy/index.php 
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net 
 https://lists.sourceforge.net/lists/listinfo/ganglia-general 



-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] SPOOF option

2008-06-18 Thread Brad Nicholes
 On 6/18/2008 at 7:39 AM, in message [EMAIL PROTECTED],
LINDA DOBAI [EMAIL PROTECTED] wrote:
 Hi,
 
 These days I was testing the 3.1.0.1399 release of Ganglia , most of all 
 the Python modules feature.
 
 I managed to plug several metrics into Ganglia using the new feature and 
 it works very well.
 I saw that SPOOF_HOST and SPOOF_NAME aren't attributes of the metric 
 descriptors , but could be attached to the metric as extradata . How 
 could I attach this option to the metric? Is it possible in the 
 3.1.0.1399 release of Ganglia? Could I have more information about this?
 

It isn't possible to create a spoofing metric module in the 3.1 code yet.  This 
is functionality that I just recently added to the trunk source tree but since 
it touched so many files, I didn't want to take a chance of destabilizing the 
3.1 branch.  If you pull and build the trunk source tree, you should be able to 
build a spoofing metric module by simply adding the SPOOF_HOST and SPOOF_NAME 
to the metric description for a python module, or adding the same data as extra 
data for a C module.

Python example:

d1 = {'name': 'my_spoof_metric',
'call_back': my_spoof_handler,
'time_max': 90,
'value_type': 'uint',
'units': '%',
'slope': 'both',
'format': '%u',
'description': 'Some spoofed metric',
'groups': 'spoof',
'spoof_host': 'spoof_ip_address:spoof_host_name',   #Same format as the 
gmetric -S option
'spoof_name': 'alternate_metric_name'}  #Optionally 
specify and alternate metric name

Hope this helps,
Brad



-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] UUID status?

2008-06-13 Thread Brad Nicholes
 On 6/13/2008 at 1:08 AM, in message [EMAIL PROTECTED], Carlo
Marcelo Arenas Belon [EMAIL PROTECTED] wrote:
 On Tue, Jun 10, 2008 at 04:59:57PM -0600, Brad Nicholes wrote:
 
 it can be solved using Ganglia 3.1 and the new gmetad-python rewrite.
 
 does it mean that you are planning on adding a backport of gmetad-python
 to the 3.1 release?
 

No, not unless there is a real demand for it.  The gmetad-python version is 
intended for a 3.2 or later release.  It could also be release in the meantime 
as its own subproject or snapshot if there is a demand for that.

 Carlo
 
 PS. the UUID feature will also need changes in the frontend that don't exist
 yet.




-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] UUID status?

2008-06-10 Thread Brad Nicholes
 On 6/10/2008 at 4:45 PM, in message
[EMAIL PROTECTED], Michael Place
[EMAIL PROTECTED] wrote:
 Hi all,
 
 The ganglia wish list at
 http://ganglia.wiki.sourceforge.net/ganglia_wish-list lists the
 following gmetad todo:
 
 * Name RRD directories based on UUID generated by client gmond
 
 Can anybody report on the status of this feature? It would be extremely
 useful in our implementation.
 

Nothing has been done for this feature in the current implementation of gmetad. 
 However, the rewrite of gmetad in python will be able to accommodate this type 
of feature.  What I mean by accommodate is that the gmetad-python version has 
implemented a pluggable interface and uses an RRD plug in to write metric data 
to RRD files.  This means that rather than having the RRD functionality built 
into gmetad itself, the RRD functionality can be plugged and replaced.  That 
would mean that for most people, they would probably want to use the standard 
RRD plugin.  For you or anybody else that wanted it, you could replace the 
standard RRD plugin with one that has been modified to create directories using 
the UUID rather than host names.  As far as gathering a UUID from a host, this 
can be done already by simply implementing a UUID metric module.  Then when the 
gmetad-python RRD storage module sees the UUID metric for a host, that is what 
it would use for the directory name.  

So the short answer is that no direct work has been done for this wish list 
item.  However, it can be solved using Ganglia 3.1 and the new gmetad-python 
rewrite.

Brad


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia 3.1.0 package

2008-06-10 Thread Brad Nicholes
 On 6/10/2008 at 11:17 AM, in message
[EMAIL PROTECTED], Bernard Li
[EMAIL PROTECTED] wrote:
 Hi Stephen:
 
 On Tue, Jun 10, 2008 at 10:07 AM, Big Woobie [EMAIL PROTECTED] wrote:
 
 I'm running Redhat Linux and IBM's AIX.
 
 AIX I can't help you (I think Ulf is trying to get that working).
 
 It is possible to install on RHEL 4.x and 5.x (and clones).  If you
 let me know which version you are on, I should be able to provide you
 with the RPM dependencies, provided that your arch is x86.
 

This kind of information would also be good on our wiki FAQ page.  We can build 
up links to libraries as needed.  Many of the linux OS distros already include 
most if not all of the components.  

Brad


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Plugging Metrics in 3.1.0 release

2008-06-06 Thread Brad Nicholes
 On 6/6/2008 at 2:28 AM, in message [EMAIL PROTECTED], LINDA
DOBAI [EMAIL PROTECTED] wrote:
 Thank you very much for your responses.
 
 As OS I am using Linux RedHat 5 32bits.
 
 As Ganglia version, I installed the last version that I found at the 
 following URL:
 
 http://www.ganglia.info/snapshots/3.1.x/ 
 
 which means : Ganglia 3.1.0.1361.
 
 How did I install it?
 
 I downloaded Ganglia 3.1.0.1361
 I saw that Ganglia need some other libraries so I used Yum tool to install 
 expat and apr libraries. I needed to download and install also
 confuse library and RRDtool-1.2.27.
 
 After the installation, all that worked in Ganglia 3.0.7 worked correctly in 
 this new version also. I don't know why modpython.so wasn't generated.
 
 Configure options that I used: 
 
 ./configure --enable-static-build --enable-python --with-gmetad
 

Is there a reason why you need to include --enable-static-build?  If not then 
don't use this option and let Ganglia build and link everything dynamically.  
Although the metric modules such as mod_python will work statically linked, 
they are intended to be dynamically linked for flexibility.

Brad


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Plugging Metrics in 3.1.0 release

2008-06-05 Thread Brad Nicholes
 On 6/5/2008 at 9:20 AM, in message [EMAIL PROTECTED], LINDA
DOBAI [EMAIL PROTECTED] wrote:
 Dear Ganglia community:
 
 I am a beginner in Ganglia.
 I have just  started  an internship of four months and my subject is 
 related to Ganglia.
 I have to test the new highlights of release 3.1.0.
 
 I had just installed the Ganglia 3.1.0 and I managed to compile it in 
 our environment. I am now trying to analyze the module support for 
 dynamically plugging metrics into gmond and I am having some problems.
 I don't manage to generate the modpython.so in our environment. I  found 
 the sources in gmond/modules/python but I don't see how could I generate 
 the modpython.so, the only lib generated by the Ganglia tool chain are : 
 mod_python.o , mod_python.lo et libmodpython.la.
 
 I found a version of modpython.so at the official website of Ganglia, 
 but it doesn't fit to our environment.
 Is it possible to generate it myself, in my environement? If yes, how 
 could I do this?
 
 And, can I have some other details about this new feature? I don't 
 understand very well how this feature will function.
 

Which OS are you building for and what were the ./configure options that you 
used?  This will help to determine why mod_python might not be building.  Also 
there is additional information about metric module development in the README 
file here 
http://ganglia.svn.sourceforge.net/viewvc/ganglia/trunk/monitor-core/gmond/modules/python/README?revision=1103view=markup
 as well as examples of both C and Python modules here 
http://ganglia.svn.sourceforge.net/viewvc/ganglia/trunk/monitor-core/gmond/modules/example/
 and 
http://ganglia.svn.sourceforge.net/viewvc/ganglia/trunk/monitor-core/gmond/python_modules/example/
 .  There is also a Ganglia presentation that describes the 3.1 version and how 
to write and install metric modules here 
http://ganglia.wiki.sourceforge.net/space/showimage/ApacheconUS2007_ganglia_monitoring.ppt
 .

Let us know how your testing goes.  We would like to make an official release 
of 3.1 soon but that depends on community members like yourself, testing the 
code.

thanks,
Brad


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Plugging Metrics in 3.1.0 release

2008-06-05 Thread Brad Nicholes
 On 6/5/2008 at 2:15 PM, in message
[EMAIL PROTECTED], Bernard Li
[EMAIL PROTECTED] wrote:
 On Thu, Jun 5, 2008 at 12:32 PM, Brad Nicholes [EMAIL PROTECTED] wrote:
 
 Which OS are you building for and what were the ./configure options that you 
 used?  This will help to determine why mod_python might not be building.  
 Also there is additional information about metric module development in the 
 README file here 
 http://ganglia.svn.sourceforge.net/viewvc/ganglia/trunk/monitor-core/gmond/mod
  
 ules/python/README?revision=1103view=markup as well as examples of both C 
 and 
 Python modules here 
 http://ganglia.svn.sourceforge.net/viewvc/ganglia/trunk/monitor-core/gmond/mod
  
 ules/example/ and 
 http://ganglia.svn.sourceforge.net/viewvc/ganglia/trunk/monitor-core/gmond/pyt
  
 hon_modules/example/ .  There is also a Ganglia presentation that describes 
 the 3.1 version and how to write and install metric modules here 
 http://ganglia.wiki.sourceforge.net/space/showimage/ApacheconUS2007_ganglia_m 
 onitoring.ppt .
 
 I also started writing an entry in the wiki about the Python modules.
 Brad if you could go over it and correct any errors, that'd be great:
 
 http://ganglia.wiki.sourceforge.net/ganglia_gmond_python_modules 
 
 Cheers,
 
 Bernard

Done.  Hopefully this will be a good guide for somebody that is getting started 
with Python modules.  We still need to write an equivalent C module doc.

Brad


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Plugging Metrics in 3.1.0 release

2008-06-05 Thread Brad Nicholes
 On 6/5/2008 at 4:48 PM, in message
[EMAIL PROTECTED], Bernard Li
[EMAIL PROTECTED] wrote:
 Hi Brad:
 
 On Thu, Jun 5, 2008 at 3:23 PM, Brad Nicholes [EMAIL PROTECTED] wrote:
 
 Done.  Hopefully this will be a good guide for somebody that is getting 
 started with Python modules.  We still need to write an equivalent C module 
 doc.
 
 Thanks for the updates.
 
 BTW I am curious about the 'modules' section in the .pyconf -- is it
 really necessary?  My Python module has been running fine without that
 section (it only has the 'collection_group section').
 
 I also have any question regarding the metric_init parameters.  Is
 'format' really necessary since we could probably determine it from
 the 'value_type' -- i.e. value_type = uint, format = %u; value_type =
 string, format = %s, etc.
 
 Cheers,
 
 Bernard

The modules sections is one of those things that is becoming more important as 
functionality grows.  Initially for a python metric module the modules section 
was really unnecessary.  The name directive is really not used if there are no 
module parameters, however their is some validation taking place against the 
language directive.  Also if your module requires any kind of configuration 
parameters, the only way to specify the parameters is through a modules 
section.  So even though your python module might run fine without a modules 
section today, it is probably a good idea to get used to adding it to the 
pyconf file just to make sure that you don't run into problems in the future.

Brad


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Plugging Metrics in 3.1.0 release

2008-06-05 Thread Brad Nicholes
 On 6/5/2008 at 5:30 PM, in message
[EMAIL PROTECTED], Bernard Li
[EMAIL PROTECTED] wrote:
 Hi Brad:
 
 On Thu, Jun 5, 2008 at 4:25 PM, Brad Nicholes [EMAIL PROTECTED] wrote:
 
 The modules sections is one of those things that is becoming more important 
 as functionality grows.  Initially for a python metric module the modules 
 section was really unnecessary.  The name directive is really not used if 
 there are no module parameters, however their is some validation taking place 
 against the language directive.  Also if your module requires any kind of 
 configuration parameters, the only way to specify the parameters is through a 
 modules section.  So even though your python module might run fine without a 
 modules section today, it is probably a good idea to get used to adding it to 
 the pyconf file just to make sure that you don't run into problems in the 
 future.
 
 My suggestion would be to add code to make sure that the .pyconf file
 has the correct sections and parameters.  You can't always expect the
 users to do the right thing :-)
 

There really isn't anyway to do that kind of validation.  Libconfuse will 
validate that a given directive is syntactically correct, but there isn't a way 
to validate that a modules section exists for each module, especially a python 
module.  If a module section does not exist for a C module, the C module will 
never load and gmond will fail if a collection group contains a metric that is 
not supported by any loaded module.  However a python module is different 
because it is run through mod_python.  Basically gmond has no idea that a 
python module even exists.  It thinks that all of the python metrics belong to 
mod_python which is a C module.  Mod_python loads any .py file that it finds in 
the python module directory and then queries each one for metric definitions 
which is a completely different code path than for a C module.

Brad


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetad.conf for large number of nodes

2008-05-28 Thread Brad Nicholes
 On 5/27/2008 at 8:47 PM, in message [EMAIL PROTECTED], randy
[EMAIL PROTECTED] wrote:
 Brad Nicholes wrote:
 
 Is there a reason why you would want to list all 120 nodes in the
 data_source directive of gmetad?  When you list multiple modes in a
 data_source directive, it does not mean that gmetad is pinging all of
 them for data.  It simply means that if gmetad can not get data from
 the first one in the list, it tries the next one.  
 
 Only because I thought I needed too. I'm doing the multicast, but I just 
 thought that all the nodes had to be listed.
 
 architecture, data is pushed from a monitored node to either a
 multicast channel or to a single gmond master node.  A gmetad
 data_source simply references one node that is listening on a
 multicast channel or the single gmond master node in the case of
 unicast.  There should never be a need to have more than just a few
 nodes listed in a data_source.
 
 So if all 120 nodes are talking on the multicast address, and gmond is 
 running on a different node on the same net, I can get by with just 
 giving localhosts as the data_source? Or any one (or more) of the data 
 nodes? I think that's how I left it today, and I was seeing reports from 
 30 of the nodes and the other 90 were listed as down. The gmond machine 
 is also serving the webpages, and all machines can see each other on the 
 network.
 
 Appreciate the help, thanks Brad!
 
 randy

I would suggest that you break up your 120 nodes into separate cluster on 
different multicast channels.  Then you would have a different data source for 
each cluster.  By putting them all on the same channel, every gmond agent is 
required to store all of the metric data for all 120 nodes.  That is a huge 
waste of memory.  I would suggest breaking them up in to smaller clusters.  If 
that doesn't work for you, then you might want to move to unicast rather than 
multicast.  In unicast mode all of your gmond agents talk to one or more gmond 
master nodes directly rather than on a channel.  Then each of the master gmond 
nodes becomes a separate data source.  Search back through the email list 
archive for similar questions.  Optimal Ganglia architectures have been 
discussed previously including multicast vs unicast.

Brad 


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetad.conf for large number of nodes

2008-05-27 Thread Brad Nicholes
 On 5/26/2008 at 7:18 AM, in message [EMAIL PROTECTED], randy
[EMAIL PROTECTED] wrote:
 I'm trying to configure ganglia (3.0.7) to monitor 120 nodes. It works 
 fine if I just enter a small number of nodes as data_source in the 
 gmetad.conf file, just like all the documentation shows. But if I try to 
 enter too many nodes, gmetad segfaults at startup.
 
 That is,
 
 data_source cloud rack1node1 rack1node2
 
 works just fine.
 
 But a data_source entry with all 120 nodes listed segfaults. I've also 
 tried using a backslash-newline to break the line apart with the same 
 result. My last attempt used multiple lines, all with the same cluster 
 name (cloud). I have about 10 lines like that, as many as it took to 
 list all the nodes. That doesn't segfault, but it only shows some of the 
 nodes (the last 30, which span multiple data_source entries).
 
 Admittedly it was Friday afternoon when I tried this, so I didn't spend 
 too much time debugging. But it's been bothering me all weekend. I 
 haven't seen any examples/docs with more than a couple nodes listed.
 

Is there a reason why you would want to list all 120 nodes in the data_source 
directive of gmetad?  When you list multiple modes in a data_source directive, 
it does not mean that gmetad is pinging all of them for data.  It simply means 
that if gmetad can not get data from the first one in the list, it tries the 
next one.  In the Ganglia architecture, data is pushed from a monitored node to 
either a multicast channel or to a single gmond master node.  A gmetad 
data_source simply references one node that is listening on a multicast channel 
or the single gmond master node in the case of unicast.  There should never be 
a need to have more than just a few nodes listed in a data_source.  

Brad


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Migrating existing RRD's to a new server;

2008-05-21 Thread Brad Nicholes
This looks like a useful script.  Can we add it to the contrib area in the 
Ganglia repository?

Brad

 On 5/21/2008 at 9:51 AM, in message
[EMAIL PROTECTED], Jason A. Smith
[EMAIL PROTECTED] wrote:
 A few years ago I had put a script on ganglia's bugzilla that modifies
 the rrd files to do a few simple things, like change the heartbeat value
 and change the number of RRAs, see:
 
 http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=33 
 
 Since we are in the process of moving our gmetad server from an old
 32-bit server to a new 64-bit server, I also needed the ability to move
 the rrd files, which meant dumping and restoring them, so I added this
 feature to the old script that I had from before.  It is attached to
 this email in case anyone else might find it useful.
 
 ~Jason
 
 
 On Wed, 2008-05-21 at 08:10 -0700, Witham, Timothy D wrote:
 There may be a workaround thought: if you use 'rrdtool dump', you will
 get a (large) XML file with all of the data.  You should be able to
 then use 'rrdtool restore' to read this back into the new .rrd file.
 
 But since you probably have hundreds or thousands of rrd files, you need
 some automation.  So use a script like the below saved as dumprestore.
 Run dumprestore dump | /bin/sh from your rrd_rootdir on the old
 system.  This will write a .xml for each .rrd.  Then rsync all the .xml
 files to new machine and run dumprestore restore | /bin/sh.  After you
 confirm it is working you can find . -name '*.xml' -exec rm -f {} \; to
 clean up.
 
 -twitham
 
 #!/usr/bin/perl
 
 use warnings;
 
 sub oops {
 die $0 dump|restore\n;
 }
 my $what = shift @ARGV or oops;
 my $in = 0;
 $in = 'rrd' if $what eq 'dump';
 $in = 'xml' if $what eq 'restore';
 oops unless $in;
 
 open PIPE, find . -name '*.$in' -print | or die $!;
 while (PIPE) {
 chomp;
 my $in = $_;
 my $out = $in;
 $out =~ s/\.rrd$/\.xml/ or $out =~ s/\.xml$/\.rrd/;
 print rrdtool $what $in , $out =~ /xml$/ ? ' ' : '', $out\n;
 print touch -r $in $out\n;
 }
 
 
 
 -
 This SF.net email is sponsored by: Microsoft 
 Defy all challenges. Microsoft(R) Visual Studio 2008. 
 http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/ 
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net 
 https://lists.sourceforge.net/lists/listinfo/ganglia-general 
 





-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] strange problem - gmond on headnode reportsdifferent data than sources

2008-05-14 Thread Brad Nicholes
 On 5/13/2008 at 11:34 PM, in message
[EMAIL PROTECTED], Jeremy
LaTrasse [EMAIL PROTECTED] wrote:
 I changed our configs over to unicast, which as seemingly eliminated most of
 our problems, except one egregious one, and the log files are still being
 filled with
 
  illegal attempt to update using time 1210742641 when last update time is
 1210742641 (minimum one second step)
 
 The problem seems to be that the gmetad is not able to get new information
 from headnodes.
 
 In the case of 1210742641, the node had not reported to the headnode for 54
 sec, and therefore rrdtool on the gmetad could not be expected to update a
 file with the same information.
 
 Output from headnodes for that node confirms.
 
 HOST NAME=HOSTNAME.twitter.com IP=X.X.X.X REPORTED=1210742641 TN=54
 TMAX=20 DMAX=86400 LOCATION=1 GMOND_STARTED=1210742641
 
 My question now is, why would gmond not be reporting for 54 sec?  The load
 on the machines that are taking longer that 20 sec to report consistently is
 lower than others in the cluster who report far more frequently.
 

Have you run gmond on that particular node in debug mode to verify that it is 
gathering and sending data correctly?  Have you hit that node directly on port 
8649 to make sure that it is generating correct XML output?  Have you run your 
head node gmond in debug mode to verify that it is only receiving packets from 
the problem node every 54 seconds?  Just trying to narrow down whether it is a 
problem with the node gmond, the head node gmond or gmetad or something 
in-between.

 Next, how can I change HOST TMAX if necessary? I've read the gmond.conf man
 page, the wiki, etc... seems like only location is configurable for host in
 the gmond.conf.
 

You can't.  For some reason TMAX is hardcoded to 20.  I'm not sure why.  The 
only way to change it would be to change the source and rebuild gmond.


Brad


 Again, system time is synchronized across all these machines to within .04
 seconds.
 
 
 Jeremy
 
 On Tue, May 13, 2008 at 3:50 PM, Bernard Li [EMAIL PROTECTED] wrote:
 
 Hi Jeremy:

 On Tue, May 13, 2008 at 1:49 PM, Jeremy LaTrasse [EMAIL PROTECTED]
 wrote:

  Where should I be going for comprehensive documentation that describes
 each
  of the stanzas in both gmond and gmetad config files, is there one
 standard
  document? I can't find one in the sourceforge wiki.

 For Ganglia 3.0.x, man gmond.conf is your best bet.  I checked and it
 talks about unicast.

 For gmetad -- the configuration options are pretty straightforward,
 and the comments in the standard gmetad.conf should be fairly
 self-explanatory.

 Cheers,

 Bernard





-
This SF.net email is sponsored by: Microsoft 
Defy all challenges. Microsoft(R) Visual Studio 2008. 
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia reported wrong OS...

2008-05-12 Thread Brad Nicholes
 On 5/12/2008 at 9:09 AM, in message
[EMAIL PROTECTED], Tom Pierce
[EMAIL PROTECTED] wrote:
 Dear Ganglia Users,
 
 I upgraded some cluster nodes, from 32 bit OS RHEL4 to 64 bit RHEL5.1, but
 the ganglia node monitor still seems to remember that the old node (with the
 same name) was x86
 
From Ganglia: Software
 
 OS: Linux 2.6.18-53.el5 (x86)
 
 logging into the node, and using  uname -m
 
 x86_64- the correct installation..
 
 I deleted the rrd file for this node, but x86 was not stored there...
 
 So how can I fix this? Or is it a ganglia bug for going from 32 bits to 64
 bits on the same nodename?
 
 ---
 Thanks
 
 Tom

Did you restart Gmetad?  Since it really doesn't make any sense to store 
constant metrics in an RRD file, all of the constant metrics are stored and 
reported from memory by Gmetad.  I believe that constant metrics are refreshed 
occasionally and eventually the correct constant metric will show up.  However 
if you want to see it immediately, you will probably have to restart Gmetad.

Brad 


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


[Ganglia-general] Time to produce a 3.1 beta

2008-05-01 Thread Brad Nicholes
  The list has quieted down over the last week or so since we release the 3.1 
snapshot.  This either means that people are busy testing the 3.1 snapshot and 
haven't had time to respond yet or that things are good and there just isn't 
much to report.  The STATUS file contains one back port proposal that still 
needs another vote but other than that, I am thinking that it is time for us to 
produce Ganglia 3.1.1 RC1.  Keeping in mind the Release Early, Release Often 
motto.  It's time to get a release out there in the next few weeks.

Brad


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] multiple gmetads polling single gmond

2008-04-25 Thread Brad Nicholes
   Gmond is single threaded.  However Gmetad is not when it produces the XML 
dump.  Would it be possible for you to use the Gmetad port rather than hitting 
Gmond directly?  If you hit the Gmetad interactive port you could request data 
for any of your individual clusters from your script.  Multi-threading Gmond 
output would be a nice thing to have.  I guess it hasn't really been an issue 
in the past because most people only hit Gmond from a single client.

Brad 

 Ben Hartshorne [EMAIL PROTECTED] 04/25/08 12:35 PM  
Hi,

I have a rather large set of machines I have ganglia watch (~6000), and
am trying to build out a resilient infrastructure.  I ran into an
interesting problem.

I am using gmond version 3.0.2.200511011714 (as reported by --version)

Basic layout - each location (~2000 machines) has a pair of hosts to
which they send their metrics (unicast).  There are a pair of machines
that connect to gmond on each of the edge collectors and centralize the
data (they connect via TCP to port 8649).  We also have another pair of
machines that connect to each edge gmond and grab the current XML dump
for integration with  Nagios (the script is called parse_ganglia for
future reference).

This worked nicely for quite a while, until one of our edge hosts got
too many reportees.  There was a connection timeout in parse_ganglia of
5 seconds, so that when one of the edge hosts was down it would move on
to the other edge hosts quickly rather than waiting 60s for the down
host.  When one of the hosts got too many reportees, it started to take
~6s to transfer all the data.  At this point, one or the other of the
pair of hosts running parse_ganglia started failing on the edge host
that had too many reportees.  

Using tcpdump, I found that though gmond was accepting the connection
from both of them, it would only send data to one at a time, and it
complete sending data to the first before moving on to the second.  so:
* host a connects
* host a starts getting data
* host b connects (3-way handshake complete) but no data flows
* host a finishes sending data
* host b starts getting data
* host b finishes getting data

We solved the immediate problem by increasing the timeout from 5 to
15s., but I was a little surprised that gmond behaved in this
seemingly-single-threaded manner.

While it's easy for us to adjust the timeout in our python
parse_ganglia, it is not so easy to poke at gmetad, and I am worried
about what will happen when we have variations in network quality, more
hosts requesting metrics, etc.  

Is it true that gmond is single threaded in its network operations?  Or
maybe just the listener?  What other effects might this have?  

Would it make sense to change gmond so it passes off dumping the XML
feed to a child thread so that multiple simultaneous connections can be
handled?

Thanks for your time,

-ben

-- 
Ben Hartshorne
email: [EMAIL PROTECTED]
http://ben.hartshorne.net




-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


[Ganglia-general] Platform experts needed (was:Re: [Ganglia-developers] Ganglia 3.1.x stable branch has been created...)

2008-04-18 Thread Brad Nicholes
   So here is another request to all you platform experts out there.  The 
Ganglia project will be rolling alpha tarballs of the Ganglia 3.1 version.  If 
the tarball does not work on your platform, please fix it and submit a patch 
back to the project.  Ganglia 3.0.x already works on a variety of platforms and 
we would like to see 3.1.x work on an equal or greater number.  But we need 
platform experts to make this happen.  Here is your chance to jump in and help 
make Ganglia 3.1 the best release ever.

Brad

 On 4/18/2008 at 4:00 AM, in message [EMAIL PROTECTED], Carlo
Marcelo Arenas Belon [EMAIL PROTECTED] wrote:
 On Thu, Apr 17, 2008 at 03:10:37PM -0700, Bernard Li wrote:
 
 I haven't been following issues regarding building on non-Linux
 architectures like *BSD, Cygwin, Mac OSX, AIX,
 
 *BSD used to be able to build and most likely still does.  Cygwin builds
 only as an static build and runs.  Mac OSX (older than 10.5) and HPUX 
 could
 be patched to build but won't work; the rest are unknown but most likely not
 able to build.
 
 Carlo




-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Fwd: [Beowulf] Performance metrics reporting

2008-04-11 Thread Brad Nicholes
 On 4/11/2008 at 1:53 PM, in message
[EMAIL PROTECTED],
Witham, Timothy D [EMAIL PROTECTED] wrote:
 So I'd like to ask the Ganglia community -- do you guys find Ganglia
to be a resource hog?
 
 No.  But once I had a couple hundred gmetad processes on a 2GB server.
 When the size of active processes and RRD files in tmpfs exceeds
 physical memory, the server begins swapping and can't keep up with the
 needed polling intervals.  Buy more memory or use more gmetad servers to
 solve this.  :-)
 
 And I don't really like that the XML is huge and mostly redundant and
 gets even larger in 3.1.  All gmetad needs is name and value to do
 correct metric rollups.  All units and other attributes appear to be
 ignored, except by the frontend.  It would be cool if, for example, we
 could define what a load_one is in one place, instead of thousands of
 machines reporting the same exact text every few seconds which seems
 like a waste of time and bandwidth.  I understand the benefits of XML,
 but perhaps the standard static attributes could be defined in gmetad
 instead of gmond.  This could reduce the XML size considerably and make
 it more efficient.  But this would require a big change; just an idea to
 think about...
 

I agree that the size of the XML could be reduced in most cases, however it 
would be impractical to define the metrics in gmeta.  The reason why is because 
of the new metric pluggable modules in 3.1.  Since gmond can be extended by 
plugging in metric modules, there would be no way for gmeta to know about every 
metric definition that could possibly exist.  With the pluggable interface 
there is no longer just a fixed set of metrics.  Any gmond could be gathering 
metrics about anything.

However during the developers meeting in Feb. we talked about an idea where the 
XML would only contain deltas rather than always sending everything.  Somebody 
just needs to figure out how to make it work.

Brad

Brad


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Fwd: [Beowulf] Performance metrics reporting

2008-04-11 Thread Brad Nicholes
 On 4/11/2008 at 4:09 PM, in message
[EMAIL PROTECTED], Bernard Li
[EMAIL PROTECTED] wrote:
 Hi Brad:
 
 On Fri, Apr 11, 2008 at 3:04 PM, Brad Nicholes [EMAIL PROTECTED] wrote:
 
  I agree that the size of the XML could be reduced in most cases, however it 
 would be impractical to define the metrics in gmeta.  The reason why is 
 because of the new metric pluggable modules in 3.1.  Since gmond can be 
 extended by plugging in metric modules, there would be no way for gmeta to 
 know about every metric definition that could possibly exist.  With the 
 pluggable interface there is no longer just a fixed set of metrics.  Any 
 gmond could be gathering metrics about anything.
 
 How about reducing the amount of XML being sent from a gmetad to an
 upstream gmetad like what I suggested in this mail?
 
 http://www.mail-archive.com/[EMAIL PROTECTED]/msg03941.
 html
 
 Cheers,
 
 Bernard

Yes, I think we need something like that for gmeta.  What I was thinking is to 
add another filter type.  Somthing like ?filter=delta or something like that, 
that would just project the delta since the last dump.  If both gmond and 
gmetad had a way to reduce the XML by just producing deltas, I think that would 
speed up the XML parsing a lot and also reduce the writes on the RRDs.

Brad


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [Ganglia-developers] Time to create the 3.1.x stable branch...

2008-03-17 Thread Brad Nicholes
 On 3/13/2008 at 3:46 PM, in message [EMAIL PROTECTED], Brad
Nicholes [EMAIL PROTECTED] wrote:
 On 3/13/2008 at 2:16 PM, in message
 [EMAIL PROTECTED], Jesse Becker
 [EMAIL PROTECTED] wrote:
 On Thu, Mar 13, 2008 at 3:42 PM, Brad Nicholes [EMAIL PROTECTED] wrote:
I think that with the removal of the srclib directory from the SVN trunk 
 repository, we have completed everything that we thought needed to be done 
 before creating the 3.1.x stable branch.  The only other thing that I know 
 of 
 is testing to make sure that an older 3.0.x gmetad can consume the XML data 
 from a newer 3.1.x gmond.  Has anybody had a chance to test this?
 
 Were we going to try and make gmond-3.1.x backwards compatable with 
 gmond-3.0.x?
 
 
 Only at the cluster level.  In other words, the XML that is produced by a 
 newer gmond should be consumable by an older gmetad and vice-versa.  This 
 would allow users to upgrade from 3.0.x to 3.1.x on a cluster by cluster 
 basis.  All nodes within a cluster would have to upgrade to gmond 3.1.x at 
 the same time.  But the whole grid would not have to upgrade.  Theoretically, 
 this should already work. We just need to test it to make sure.
 

So I just set up two clusters here.  One that is running pure 3.0.x and another 
that is running pure 3.1.x with their respective web frontends reporting the 
cluster data.  Then I crossed the two clusters by declaring new data_sources in 
the gmeta.conf file of each gmetad/web frontend servers.  I can hit the web 
frontend of either server and view the RRD graphs for both clusters.  In other 
words, the 3.1.x gmetad and frontend is able to acquire data from a 3.0.x 
cluster and a 3.0.x gmetad and frontend is able to acquire data from 3.1.x 
cluster.  So the cluster by cluster migration from 3.0.x to 3.1.x should work 
just fine.

Brad


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [Ganglia-developers] Time to create the 3.1.x stable branch...

2008-03-15 Thread Brad Nicholes
 On 3/14/2008 at 1:35 AM, in message [EMAIL PROTECTED], Carlo
Marcelo Arenas Belon [EMAIL PROTECTED] wrote:
 On Thu, Mar 13, 2008 at 01:42:05PM -0600, Brad Nicholes wrote:
I think that with the removal of the srclib directory from the SVN trunk 
 repository, we have completed everything that we thought needed to be done 
 before creating the 3.1.x stable branch.
 
 Agree (sorta), as I was expecting also as part of this removal to have all
 #ifdefs for compatibility with apr 0.9.x removed and which will otherwise be
 problematic to support without intrusive changes going forward. 
 

Right.  I'll get the #ifdef's removed as well.

I am proposing that we create the 3.1.x branch on Monday (3/17).
 
 Considering how intrusive the changes are it will be better to delay that 
 for
 a little longer, otherwise we will found ourselves destabilizing the branch
 with fixes required to get the snapshots in good shape to be used for 
 testing
 (as ganglia now barely builds and has known problems running in some of the
 popular supported architectures).
 

OK, then let's get some snapshots tarballs going on TRUNK and get this stuff 
worked out.  If we delay a week, will that be long enough or is there more to 
it than I am seeing?


 Once we create the branch, I would suggest that we also take a snapshot 
 tarball and get moving on testing and stabilization.
 
 this I suggest we release as our first ever alpha for 3.1.x, but we rather
 be sure it builds and runs well enough for users to be able to use it and
 help stabilize the 3.1 release.
 

+1

Brad


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


[Ganglia-general] Time to create the 3.1.x stable branch...

2008-03-13 Thread Brad Nicholes
   I think that with the removal of the srclib directory from the SVN trunk 
repository, we have completed everything that we thought needed to be done 
before creating the 3.1.x stable branch.  The only other thing that I know of 
is testing to make sure that an older 3.0.x gmetad can consume the XML data 
from a newer 3.1.x gmond.  Has anybody had a chance to test this?  

   I am proposing that we create the 3.1.x branch on Monday (3/17).  This 
should give people the weekend to tidy up anything that is left before we 
create the branch and start working towards a stable 3.1.x release. This will 
include all of the web frontend changes that have recently been committed to 
TRUNK.  Are there SPEC file changes that need to go in to support the modular 
web frontend stuff or was that already done?   Once we create the branch, I 
would suggest that we also take a snapshot tarball and get moving on testing 
and stabilization.  We are going to need the help from everybody whether you 
are a Ganglia developer or just a Ganglia user to make sure that we have a 
stable 3.1.x release.  

Of course once we do create the branch, the branch will be under RTC rules 
http://ganglia.wiki.sourceforge.net/ganglia_works .

Any comments,

Brad


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] additional info about fsock open error

2008-02-15 Thread Brad Nicholes
 On 2/15/2008 at 9:34 AM, in message
[EMAIL PROTECTED], Mike Olson
[EMAIL PROTECTED] wrote:
 Just an FYI, I have the ports 8649 to 8652 forwarded on my router to my
 Apache web server.  I have looked at the file on line 283 and I don't know
 what part of that line is creating the error.  The line is below:
 
 $fp = fsockopen( $ip, $port, $errno, $errstr, $timeout);
 
 Thanks,
 Mike

Those ports should be forwarded to gmond (8649) and gmetad (8651,8652).  Take a 
look at your gmond.conf and gmetad.conf file where these ports are configured.  
If you have those ports forwarded to the Apache server, Apache won't know what 
to do with the requests.  In fact what is happening is that you have just 
looped back from the PHP code running under Apache to the Apache server itself. 
 

Brad


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] new property of a host

2008-01-31 Thread Brad Nicholes
 On 1/31/2008 at 4:20 PM, in message
[EMAIL PROTECTED], Doug Nordwall [EMAIL PROTECTED]
wrote:
 For reference, here is a current HOST line of XML
 
 HOST NAME=mybox.local IP=10.1.1.1 REPORTED=1201820930 TN=7  
 TMAX=20 DMAX=0 LOCATION=unspecified GMOND_STARTED=1200935314
 
 We're looking at making some patches to ganglia to do one of the  
 following things:
 
 1) replace the _value_ with the value of the job_id, and modifying the  
 code so that you can update the location. This would also include the  
 option to keep the location as it is written currently, or spoof it  
 with a job_id (or arbitrary value)
 2) add a new field to the HOST entry called job_id and allow it to be  
 updated without restarting ganglia.
 
 Sadly, we can't use a job_id a metric (well, we could but it's more of  
 a pain in other part of our code). For the purposes of our machine, a  
 machine is dedicated to a single job. The ultimate idea is to be able  
 to drop whatever we like into that host field and have it change  
 without an entire restart of ganglia.
 
 Do any of the main ganglia dev folks have an opinion here? We'd be  
 looking to submit this back into the main stream once we are done.
 
 As always, I'm open to be told I'm wrong or it has been done :)
 

I'm not sure that I understand what you are trying to do.  From the 
description, I am assuming that you are trying to associate a job_id with a 
host.  The easiest way to do that would be to just create a metric module that 
returns a constant job_id string as a metric.  Yet, you mentioned that for some 
reason you can't do that.  Could you explain why?  Otherwise just adding a 
job_id attribute to the HOST doesn't sound very useful except in your case.  
Is there some general benefit to adding a job_id attribute?

Brad


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] scaling_max_freq error

2008-01-28 Thread Brad Nicholes
 On 1/25/2008 at 7:21 PM, in message
[EMAIL PROTECTED], Jesse Becker
[EMAIL PROTECTED] wrote:
 On Jan 25, 2008 9:06 PM, Bernard Li [EMAIL PROTECTED] wrote:
 Hi Jesse:

 On 1/25/08, Jesse Becker [EMAIL PROTECTED] wrote:

  Interesting.
 
  How about introducing a new metric:  cpu_speed_current.  For systems
  without cpuscaling, it is forced to the same value of cpu_speed, or
  not sent at all, since it is completely redundant.  Otherwise, it
  reports the current CPU speed, as determined by /proc or /sys.

 I believe the CPU speed is collected once at startup -- what if the
 CPU speed was clocked down (or up) after gmond started and this
 current speed was collected -- then it no longer is current.
 
 That's my point.  The current behavior is collect the speed once, and
 if it potentially variable, report the maximum speed.  That's fine
 behavior.
 
 The other option is to collect this metric periodically, but probably
 only do this if scaling is enabled.
 
 Yep, that works too.  It certainly doens't need to be collected/sent
 frequently.  I'd think that once a minute is probably ample.
 
 But do people really need this feature? :)
 
 Actually, I can think of a good use for it:  proving reduced power
 consumption.  If you can point to a chart that shows you are actively
 throttling CPUs, that's a bonus for the people with clipboards and
 checkbooks. :-)
 
 This is all pretty minor stuff though.


Sounds like a fairly simply python module to write.  Just check for the 
scaling_max_freq file existence and then read it and report it.  If it doesn't 
exist then either just return 0 or default to cpuinfo.

Any takers?

Brad


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


  1   2   >