Re: [Ganglia-general] [Ganglia-developers] [ANNOUNCEMENT] Ganglia meetup Tue Oct 21 in San Francisco (Quantcast HQ)
Also forgot to mention we have a code for FREE parking thanks to our friends of zirx[1], so if you are driving to SF for this meeting (like I am planing to do) all you really need is to install their app, hit the right address for quantcast in the map and hit the price to enter your code: GANGLIA so someone will be waiting for you at the door and take your car to a safe place. see you all in the other side, and lets have fun Carlo [1] http://zirx.com/ PS. you need an iphone or android phone to use their app though -- Comprehensive Server Monitoring with Site24x7. Monitor 10 servers for $9/Month. Get alerted through email, SMS, voice calls or mobile push notifications. Take corrective actions from your mobile device. http://p.sf.net/sfu/Zoho ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Having to restart gmond on sender nodes if a collector node restarts
if using Unicast, and ganglia 3.1.0 or higher, you MUST set this value to whatever value is the minimum you can tolerate (in seconds) for a restart to recover itself automatically like 3.0 (and lower) used to be able to do. this feature is documented at the bottom (the fifth bullet point under Important Notes) of the following (no longer maintained) page : http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_release_notes Carlo On Fri, May 04, 2012 at 02:30:31PM -0700, Gilles Devaux wrote: Have you tried setting send_metadata_interval = Something in the gmond.conf (globals section) ? On Fri, May 4, 2012 at 1:58 PM, Pronko, Eric pron...@upmc.edu wrote: I'd be interested in learning anything you hear back if you don't mind. It's not the worst thing but if it goes unnoticed it could be problematic. Thank you From: Jochen Hein [joc...@jochen.org] Sent: Friday, May 04, 2012 4:23 PM To: Pronko, Eric Cc: ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Having to restart gmond on sender nodes if a collector node restarts Pronko, Eric pron...@upmc.edu writes: I have run into a situation where I restart a GMOND collector node, and then all sending node data isn't received until I also restart GMOND on those nodes. ?Is this intentional - is there any way to work around this so that I can safely reboot a collector node without having restart GMOND on all senders? I have that issue too. We are running gmond 3.1.7 on AIX (the packages of Michael Perzl) in unicast mode. I've already talked to Michael about it, but he has no idea. I'm going to try running gmond under strace/truss and maybe we can see something. Jochen -- The only problem with troubleshooting is that the trouble shoots back. -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Having to restart gmond on sender nodes if a collector node restarts
On Fri, May 04, 2012 at 04:47:22PM -0700, Steven A. DuChene wrote: OK, but what is a reasonable value to set this to? there is none, hence why there is no default. ideally you shouldn't need it just like it wasn't needed before 3.1, but in order to support metadata information the protocol had to be changed and there was no other solution than to put this as a workaround when the unicast configuration was reported broken by it. once a better solution is found (which will likely break the gmond protocol again) it shouldn't be needed, but meanwhile telling gmond to periodically resend the metadata information for metrics is the only way to deal with a collector that went down, and starts receiving (and ignoring) metric data without a matching metadata information to use. 30 seconds? 2minutes? 10minutes? it depends on how much bandwith/contexts are you willing to sacrifice for getting the data you need in case of a collector failure and therefore on which kind of setup are you running ganglia. in HPC (where ganglia) started this might be a showstopper and prevent some people to upgrade past 3.0, but in an IT like environments I'd seen people use 60 and hope the extra work gets lost on the noise. I'll recommend you do your own measure and decide based on : * number on nodes you have * number of metrics per node * how much metadata on each metric * how much CPU/bandwith can you spare on each node for gmond * how many collectors you have and how many gmond they are all serving (peak) * how much CPU/badwitch each collector can use * how sensitive are you to having data holes and restart gmond instead Carlo -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [Ganglia-developers] 3.3.5 released today
On Thu, Apr 05, 2012 at 12:52:21AM +0200, Daniel Pocock wrote: A number of bugs were found during the testing of 3.3.5 and discussed on the mailing lists. could a list of this bugs be published somewhere with the release, so that anyone knows what to expect if upgrading (most people probably still using patched 3.1.7 as that is what is provided by most distributions) from the top of my head there are : * 2 memory leaks (one probably only in deaf mode) * gmetad hierarchical mode is broken In other words, anyone who is using 3.3.1 or 3.3.0 should not get any new bugs from upgrading to 3.3.5 considering that 3.3.5 doubled the in memory size of each metric, it is likely to make the memory leaking problems worse though Carlo -- For Developers, A Lot Can Happen In A Second. Boundary is the first to Know...and Tell You. Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! http://p.sf.net/sfu/Boundary-d2dvs2 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Ganglia for Windows
On Wed, Mar 28, 2012 at 12:49:00PM +0100, Burton, Steven wrote: Should I try those binaries or should I build a more recent version and if so, what version? 3.0 and 3.1 are not compatible, so you either : 1) build new binaries for 3.1.7 or newer and deploy it on windows 2) downgrade your server to 3.0.7 (notice you need a couple of patches on top of it for security) Carlo -- This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] tcpconn.py and netstat
On Mon, Feb 27, 2012 at 11:12:39AM -0500, Chris Burroughs wrote: Currently tcpconn.py uses netstat to get it's socket stats. This gives lots of detail but is far too slow for much production use (running netstat can take many minutes). tcpconn.py was originally written as an advanced python module example as it shows how to do multithreaded modules and how to poll metric information from an external source. it was found useful enough and some people enabled it, if it made sense for their environments (definitely not in HPC, or high traffic) but since it is a module it can be replaced for something that fits better on your environment like the proposal you had. /proc/net/sockstat gives less information but has no performance problems. There was a suggestion previously to use the ss command, but (1) it's less common (at least not part of the default on RHEL5) and (2) it also lacks the high fidelity details. Is there any other reason to prefer ss over cat? Should this replace tcpconn, or be a new module? most likely (for the reasons explained above) would be a new module and if you are concerned about performance, mos likely in C Carlo -- Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmond too many open files
On Fri, Feb 24, 2012 at 04:30:09PM -0800, M. Leong Lists wrote: Is this a bug in the app not closing those files most likely a module, but to pinpoint which one you would need (assuming you are running linux) the output of : $ sudo lsof -n p `pidof gmond` Carlo -- Virtualization Cloud Management Using Capacity Planning Cloud computing makes use of virtualization - but cloud computing also focuses on allowing computing to be delivered as a service. http://www.accelacomm.com/jaw/sfnl/114/51521223/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] new Web interface ETA?
On Tue, May 10, 2011 at 08:28:53AM -0500, Alex Dean wrote: Is anyone using lighttpd? I'm sure we can come up with some configuration instructions for it as well if desired. yes, usually most users of lighttpd would use PHP through the fastCGI interface but that shouldn't make much difference on the way the web application is configured. having simple access to the tip of the development in a packable form (snapshots), together with instructions on how to obtain it will be most likely helpful to increase testing and help stabilizing though. in that line, would this be good candidate for a snapshot? https://github.com/vvuksan/ganglia-misc/tarball/master and where is the documentation for it being collected? Carlo -- Achieve unprecedented app performance and reliability What every C/C++ and Fortran developer should know. Learn how Intel has extended the reach of its next-generation tools to help boost performance applications - inlcuding clusters. http://p.sf.net/sfu/intel-dev2devmay ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetric and groups
On Thu, Apr 21, 2011 at 11:09:12AM -0400, Jesse Becker wrote: On Thu, Apr 21, 2011 at 10:11, Rushton Martin jmrush...@qinetiq.com wrote: Is there any way to define which groups a gmetric collected statistic is put into? ?I cannot see any way defined on the man page, and the mail So far as I know, this isn't supported by gmetric. Groups were added after the core gmetic code was written, and that ability hasn't been added back to gmetric (yet). gmetric in trunk has that functionality, it wasn't backported yet though but it is available and since could be easily added (indeed it has even a backport proposal for it that was voted to be merged but hasn't), so getting a ~3.1.7 gmetric with this functionality is very straight forward. Carlo -- Benefiting from Server Virtualization: Beyond Initial Workload Consolidation -- Increasing the use of server virtualization is a top priority.Virtualization can reduce costs, simplify management, and improve application availability and disaster protection. Learn more about boosting the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] How to locallize the web-frontend?
On Tue, Mar 22, 2011 at 01:43:02PM +0800, wrote: Hey, I am in chinese and i like ganglia vey much So,i want to translate the web-frontend from enlish to chinese , How can i do? That would be awesome indeed, but I am affraid that the PHP code used in the frontend is not really structured for doing localization work in a non programatic way (like using GNU gettext and message catalogs) If all you need is a chinese version of the frontend and have some basic PHP skills, then it should be straight forward to translate the text from the frontend by editing the template and PHP files, but that would be then very difficult to integrate back into ganglia for future changes. if you have some more advanced PHP skills, then adapting the code to be L10N friendly would be a better approach, and might require as well some deeper changes in gmond/gmetad for what C/Python skills would be needed. what is your profile, and what would be you more interested on doing? Carlo -- Enable your software for Intel(R) Active Management Technology to meet the growing manageability and security demands of your customers. Businesses are taking advantage of Intel(R) vPro (TM) technology - will your software be a part of the solution? Download the Intel(R) Manageability Checker today! http://p.sf.net/sfu/intel-dev2devmar ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Master and Clients not visible at the same time
On Mon, Jul 05, 2010 at 10:49:22AM +0200, Wim De Geeter wrote: Any one an idea why ?? because the master and the nodes are in 2 different clusters, for any of the following reasons : 0) they are in different network segments and running by default in multicast, so packets between them are getting dropped at your switch 1) you are using different multicast addresses on their configuration 2) you have some other kind of firewall that is blocking packets either way. if you really want them all in the same cluster, you probably want to change the configuration in the nodes/master to use unicast and point them to the master. the manual page in gmond.conf, the README and some of the links on the wiki are good sources for how to do that, but is is usually as simple as changing udp_send_channel and udp_recv_channel as shown below and restarting your gmond. udp_send_channel { host = master port = 8649 } udp_recv_channel { port 8649 } Carlo -- This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] newbie install of 3.1.7
On Tue, Jun 22, 2010 at 11:16:54AM -0700, Deb Heller-Evans wrote: In our set up, I am configuring gmond for unicast communication, and have set up gmond.conf on the nodes to have the following: 52 udp_recv_channel { -- 53 host = 198.129.76.131 54 port = 8649 55 } 56 BUT, when starting gmond on the node, gmond complains: [108#] service gmond start Starting GANGLIA gmond: /etc/ganglia/gmond.conf:53: no such option 'host' Parse error for '/etc/ganglia/gmond.conf' [FAILED] I'm a little puzzled by this. Could someone point me in the right direction? man gmond.conf would show you there is no host option for udp_recv_channel but probably the option you are looking at is bind which will tell ganglia to bind to a specific IP for the unicast listener. Carlo -- This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Ganglia + Windows - Compilation Problems?
On Wed, Jun 23, 2010 at 01:40:52PM -0500, Douglas Wagner wrote: So I build libconfuse on Cygwin on my local XP development box and it gets stuck into /usr/local/* (lib, include, etc.). is it libconfuse 2.7 compiled as an static library and no nls support as suggested in README.WIN? is this using cygwin 1.5 on 32bit windows or are you using 1.7? Come back around (according to the README.WIN and tell ganglia to compile --with-libconfuse=/usr/local and it blows up telling me it can't find libconfuse. config.log would explain why, but hope it is not that you are trying to build it for 64bit windows. Linking everything into /usr/lib doesn't help either. I've seen docs on this but assumed it was supposed to be fixed in 3.1.2. not sure what you are referring here, but are you trying to build 3.1.7? noticed the README.WIN documents are not mentioning the need to override sysconfdir (which is irrelevant for cygwin anyway) and were not completely updated when the libpcre dependency was added (which also changed name recently in cygwin) for that release but used to work at least with 3.1.4 from what I remember and therefore probably also for 3.1.2. the following seemed to work for me on an updated windows vista laptop I had access with and with the latest cygwin (mostly using instructions from README.WIN and against the recommendation of sticking with 1.5, which will therefore require some patching) : $ tar -xvzf confuse-2.7.tar.gz $ cd confuse-2.7 $ ./configure --disable-nls $ make $ make install $ cd .. $ tar -xvzf ganglia-3.1.7.tar.gz $ cd ganglia-3.1.7 $ find . -type f -name *.h -a ! -name config.h -exec fgrep -l rpc/rpc.h {} \; | xargs -n1 perl -pi -e s;#include rpc/rpc.h;#include cygwin/in.h\n#include rpc/rpc.h;g $ ./configure GANGLIA_ACK_SYSCONFDIR=1 --with-libconfuse=/usr/local --enable-static-build $ make $ cd .. $ mkdir dist $ cp -a ganglia-3.1.7/gmond/gmond.exe dist/ $ cp -a ganglia-3.1.7/gmetric/gmetric.exe dist/ $ cp -a ganglia-3.1.7/gstat/gstat.exe dist/ $ cd confuse $ make uninstall $ cd .. $ rm -rf confuse* ganglia* the binaries in dist will need to be installed in the other nodes probably including the corresponding cygwin dll that they were built with if cygwin won't be installed independently (cygwin1.dll, cygapr-1-0.dll, cygexpat-1.dll, cygpcre-0.dll, and libpython2.6.dll). the following dependencies were installed as prerequisites on the system that was used for building this package (listed with `cygcheck.exe -c -d`) : diffutils2.9-1 expat2.0.1-1 libexpat12.0.1-1 libexpat1-devel 2.0.1-1 gcc 3.4.4-999 gcc-core 3.4.4-999 gcc-g++ 3.4.4-999 gcc-mingw-core 20050522-1 gcc-mingw-g++20050522-1 libgcc1 4.3.4-3 libapr1 1.4.2-1 libapr1-devel1.4.2-1 make 3.81-2 libpcre-devel8.02-1 libpcre0 8.02-1 python 2.6.5-2 sharutils4.8-1 sunrpc 4.0-3 Carlo -- This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] ganglia monitors does not work for some of the clusters
On Thu, Jun 24, 2010 at 07:37:25AM +0200, Raimund Eimann wrote: I have exactly the same issue with version 3.1.7. When I restart gmond on the affected nodes, their graphs work for some time (1-2 days typically). I use CentOS 5.{4,5} on my nodes. Usually the problem does not affect a cluster as a whole, but only a large number of nodes in the cluster (for insance, for 14 out of 17 nodes nothing gets displayed). are you using multicast or unicast? does setting send_metadata_interval to 60 or some other non zero value help? Carlo -- This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Gmond udp_send_channel using the wrong network (seems hostname related)
On Thu, Jun 24, 2010 at 10:21:53AM +, Ronny wrote: I am facing the problem, that my gmond udp_send_channels sends via the wrong network interface on a multi homed linux machine. there is some information on multihomed setups in the README which could help. The machines have a front NIC and an backend NIC. Both IPs from the NICs get resolved by the name service, but the primary IP's dns name is the system's hostname (with an IP address out of 62.48.x.x) In my clients gmond.conf I have set: udp_send_channel { bind_hostname = yes # Highly recommended, soon to be default. # This option tells gmond to use a source address # that resolves to the machine's hostname. Without # this, the metrics may appear to come from any # interface and the DNS names associated with # those IPs will be used to create the RRDs. host = 10.0.11.16 port = 8649 ttl = 1 } whereby 10.0.11.16 is the backend network. But this gmond seems to ignore to use 10.0.11.16 and sends via the primary IP adress 62.48.x.x to the udp_receive_channel locatet on another host. A firewall between send_channel and receiver channel machines using 62.48.x.x is blocking that traffic. I can't currently open the firewall. that is what bind_hostname is meant to do AFAIK, maybe you would like to use instead bind = 10.0.11.16 (host should point to your collector if using unicast, so host and bind should be most of the time different ips in 10.0.11.x unlike this example) Carlo -- This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Gmond udp_send_channel using the wrong network (seems hostname related)
On Sat, Jun 26, 2010 at 03:29:17PM -0400, Vladimir Vuksan wrote: More than 4 years ago I reported a bug regarding gmond not honoring mcast_If setting http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=94 mcast_if should be working fine in 3.0 since 3.0.5, could you confirm that? now you should be able to force multicast traffic to go through a specific interface if adding mcast_if into the corresponding udp_send_channel setting. it was broken again though in 3.1 and while it was fixed again for 3.1.2 as shown by BUG140 you would need 3.1.7 for a full fix and set of directives that are meant to help control all parts of functionality including also the IP that would be used as the source (which is what bind and bind_hostname are for) independently of the interface or IPv4 routing. We resolved it by adding a route. It would seem that in unicast mode this should require no changes. Can you send us what your routing table looks like ? unicast could use a different IP as the source if instructed to do so by explicitally binding to it or to the resolvable hostname as it seemed by the original reported configuration. agree though documentation is a little thin around of all it (there is also some complementary explanation in the README) specially with 3.1.7 which has now several overriding settings that affect this (routing, mcast_if, and bind/bind_hostname) Carlo -- This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] mcast_if in ganglia 3.1.2 using the wrong source address
On Sat, Jan 09, 2010 at 01:50:42PM +0100, Stefan Ott wrote: I'm not sure whether this has been reported yet but since I couldn't find a report, I assume it hasn't. it has been reported [1] and a fix is available and committed in trunk in r2121 [2] and should hopefully be part of 3.1.6 when released. I'm using ganglia 3.1.2 and I have the following issue: if I set mcast_if in my udp_send_channel section, the packets *do* use the right interface, but the wrong source address (they use the address from the other interface). If I manually add a route to 239.2.11.72 (my multicast group) it works. this is also poorly documented (there is an obsolete note in the README under the How should I configure multihomed machines section) but if you are using Linux one way to workaround the issue is to add a static route to the multicast IP or network through the interface you configured in mcast_if, if you are using Solaris [3] then you need to apply a patch to your ganglia or wait for 3.1.6 to use either the bind or bind_hostname parameters for udp_send_channel or rely on the fixed mcast_if parameter which is still not backported, if you are using something else and still having the issue then would be a good idea if you try either workaround/solutions and report back so we can then modify the fix further to accommodate also for your case (more changes will be still needed for sure to document correctly this issue as well in any case), 3.0 is suspected not to be affected but if anyone can confirm/deny that then we could also apply this fix for 3.0.8 and close BUG94 [4] if still open. Carlo [1] http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=140 [2] http://ganglia.svn.sourceforge.net/viewvc/ganglia?view=revrevision=2121 [3] http://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg05230.html [4] http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=94 -- This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Extending the format of gmetad.conf
On Mon, Dec 28, 2009 at 10:16:28PM +, Daniel Pocock wrote: I'm looking at extending the gmetad.conf format, while still making sure that it can read the existing config files. adding a new configuration option would be the easiest way to prevent any backward incompatible change which will then force this feature to be 3.2+ only. My goal is to allow different sets of RRAs for different sources, while making sure the existing file format remains valid. why do you want to have this? what is the use case for having different metric storage frequencies per cluster and why can't be done by having instead independent gmetad? if you are talking about different metric storage frequencies per metric as it seems to be implied later (and which is a feature long in the wishlist) then wouldn't be safe to assume you want that storage for that metric regardless of source?, if that is the case it will simplify the implementation and will only require something like RRAs_template as shown in d and not need a, b, or c at all (or at least not as part of the first implementation). currently in data_source the polling interval is optional and so the same could be done with the template to apply in the long run, but complicating the configuration parser, for IMHO no really good reason. using a script is definitely interesting because of the flexibility it allows for, but as mentioned before a problem because of the additional forking required and also problematic because it will keep part of the logic outside gmetad. Carlo -- This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Extending the format of gmetad.conf
On Mon, Jan 04, 2010 at 06:55:36AM -0500, Jesse Becker wrote: On Mon, Jan 4, 2010 at 03:46, Carlo Marcelo Arenas Belon care...@sajinet.com.pe wrote: My goal is to allow different sets of RRAs for different sources, while making sure the existing file format remains valid. why do you want to have this? what is the use case for having different metric storage frequencies per cluster and why can't be done by having instead independent gmetad? I can think of reasons why you'd want different frequencies for the same metric, mostly having to do with required data retention policies and lack of resources (disk space). It could be done with different gmetad processes, but that gets complicated for a simple cluster (multiple gmetad polling the same gmond, the same data is displayed in two different locations). Of course I can think of reasons why that might be something you would want to have, and that is why I said that might be needed in the long run if those reasons are genuine, but I will be surprised if there is a reason good enough to do that from the very beginning when using multiple gmetad would solve it for now IMHO. The point is that the syntactical sugar to make that work would be far more complicated and difficult to do in a 3.1 compatible way than just adding templates and therefore I would tend to believe that it would make more sense as a 3.2 feature, while having different RRAs independently of which datasource has been in the wishlist since even before 3.1.0 was released and would be something you would instead want backported ASAP (probably even to 3.0 if there is demand) if you are talking about different metric storage frequencies per metric as it seems to be implied later (and which is a feature long in the wishlist) then wouldn't be safe to assume you want that storage for that metric regardless of source?, if that is the case it will simplify the implementation and will only require something like RRAs_template as shown in d and not need a, b, or c at all (or at least not as part of the first implementation). currently in data_source the polling interval is optional and so the same could be done with the template to apply in the long run, but complicating the configuration parser, for IMHO no really good reason. using a script is definitely interesting because of the flexibility it allows for, but as mentioned before a problem because of the additional forking required and also problematic because it will keep part of the logic outside gmetad. Perhaps I'm misunderstanding how using a separate script would work, but there would only be a fork storm during initial RRD creation, correct? it depends on what the script does, but that is correct in the case that the script is only returning the RRAs back to gmetad as you suggested. still the disadvantage (as mentioned above) of not being able to know from just reading the gmetad.conf which RRA apply on each case still applies and would probably imply that the best way to do this will be to make gmetad modular (just like gmond) and then allow it to write its own configuration or use one by default that could be used as a starting point just like `gmond -t` allows for. I had assumed that the current behavior of keep existing RRD file would remain. Thus, the only time we would really have to worry about forking off hundreds/thousands of processes would be when a new cluster is created, or when the RRD files are all removed for some reason. Under normal operating circumstances, the RRD files already exist, so there's no need to run the creation script. or when gmetad is restarted and have to again figure out which RRA apply on each case for the updates and unless gmetad.conf has all that information somehow in a static way (by using for example the modular solution instead of just a script). Carlo -- This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Debugging problems with Ganglia
On Thu, Dec 17, 2009 at 01:58:31PM -0500, Douglas Wade Needham wrote: Here are the details: Version:3.1.2 Host: VM running Debian Etch (1 VCPU, 512 MB) w/ 2.6.26-2-xen-amd64 kernel HW Host: Dual AMD Opteron 244's running at 1.8GHz Assume here Host really means the virtual server that is allocated for this task and therefore not to be confused with HW Host below which is also running Debian as a virtual server host, and all of them are stock debian etch, correct? I'd seen severe IO starvation issues (including panics) using the stock debian guest kernel which only resolved themselves after upgrading it to 2.6.32 (here guest refers to the kernel used in the virtual machine), not to the kernel in the server that hosts all the VM machines and that will be more difficult to upgrade. When we have a program connect across our 1Gbps network connection to this gmetad, we end up with very gappy data, if the hosts don't just get marked as down and the RRDs stop updating. I have already started pressuring those who would approve moving our RRDs to a memory fs, but in the meantime... :( IMHO this is your only solution considering the IO issues that VMs have. and the way the gmetad scales and uses RRD (which is also very IO unfriendly) eventhough it has been mentioned that rrdcached could help somehow. I have been running straces which have indicated that we occasionally have threads which block on the futex() call for 10+ seconds, and occasionally for as long as 500+ seconds. To limit the impact of the strace (which itself can cause the same problem), I even had to do: strace -f -tt -T -s 160 -e trace=process,futex,signal -o gmetad.strace.out2 -p 12618 interesting, and I suspect IO related most likely but you would be probably able to get a better picture using instead a pthread specific tracer like mutrace (warning, fairly new code and only packaged for Fedora 12 AFAIK) : http://git.0pointer.de/?p=mutrace.git But in doing this, I have come up with the following questions: 1) Is there any difference between '-d 1' and '-d 10'? Or between 'debug 1' and 'debug 10' in the config file? In looking through the code, it does not seem to be the case. I would just like confirmation. not for gmetad AFAIK, but there are several arbitrary uses of debug_level which usually mean you want to use the highest level possible most of the time anyway. 2) Am I seeing correctly that we have the following pthread_mutex definitions? - server_socket_mutex - server_interactive_mutex - Allocated mutex for root summary. - Allocated mutex for each grid partial-summary (1 per data source) - Allocated mutex for each cluster partial summary (1+ per data source) there is also an rrd_mutex for updating the RRD, and would recommend keep away from the multiple summary mutexes if you want to keep your sanity. 3) Would there be any interests in patches against 3.1.2 to watch calls to pthread_mutex_lock() and pthread_mutex_unlock() to display when a call took more than a certain amount of time to return, or if a lock was held for longer than a certain time?? definitely interesting and if to be enabled (preferably) at compile time to avoid any added performance degradation and race conditions of its own, but probably OK too if only enabled at run time through debug mode. beware though that trunk (where the patch would need to be applied first) and 3.1 (where 3.1.2 comes from) might not be on sync on this code which has seen several changes lately. would be interesting also to see how a patched 3.0.7 (or the 3.0 branch HEAD) would perform in this case as an alternative. there is also a python version of gmetad in trunk which might help with what you are doing. This last one comes, as given my suspicions on thread starvation, I am going to have to instrument a gmetad a bit more to look at the mutexes and how long we are in critical sections. beware gmetad code is a little rusty so report back if you see anything else that doesn't look quiet right. Carlo PS. this thread might be better fitted for ganglia-developers. -- This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Ganglia trouble (sorry)
On Wed, Dec 16, 2009 at 08:37:44AM +0100, Tommy Schneider wrote: Also, for RPMS... the only RPMs i found (for EL5 Linux) are for ganglia 3.0.7 have you tried using the EPEL packages for 3.0.7 instead?, why would you need to move to 3.1.2 manually?, 3.0 is still a supported branch/version anyway. Carlo -- This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Hadoop and ganglia
On Wed, Dec 09, 2009 at 11:33:59PM -0500, John Martyniak wrote: So on the over page, none of the graphs have any data. The number of hosts is correct, but the number of CPUs is 0. If pick cpu_report as the metric, it results in a broken image for each node detail graphs, and the overview graphs show no data. you have all gmond reporting their metrics too? (AKA mute = no)? If I drill down to look at specific data about each node, the graphs show information but not all graphs, and some show broken images. the web frontend for 3.0 usually assumes that the core metrics are being reported and if they are not by configuration then those reports should be broken (as expected, because you explicitally told gmond not to generate that data). Carlo -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Hadoop and ganglia
On Thu, Dec 10, 2009 at 08:52:56AM -0500, John Martyniak wrote: Yes I have mute=yes at this time, so as to filter out the localhost, by localhost you mean the web/gmetad machine?, then you might as well instead then just shutdown that gmond and poing gmetad to use one of the gmond in the cluster of hadoop boxes instead as a data source since it is just a monitoring machine, and not really part of the cluster. if you want to have also reports for the core metrics (like cpu utilization and count) in the machines that are part of the cluster then all of them should be mute = no and have reasonable thresholds and collection rates as the ones provided by the default configuration. Carlo -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Problems with Ganglia web interface
On Thu, Dec 10, 2009 at 04:17:18PM +0100, Samuel Gimeno wrote: All Xml of all gmond and gmetad are well formed, all echos OK. Did you say something about apparmor problems? What I can make to fix it? I think that that can be the problem all the other things I tried are good... no idea as I don't use OpenSUSE but google suggested you try : http://developer.novell.com/wiki/index.php/Apparmor_FAQ#How_do_I_enable.2Fdisable_AppArmor.3F Carlo -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmond 3.1.2 becomes deaf in Solaris SPARC
On Thu, Dec 10, 2009 at 10:53:50AM -0500, Jorge Medina wrote: This time my gmond stayed awake much longer, but eventually went deaf (after 15 hours). then you have another problem (maybe in addition); could you see if by chance that is no longer the case with the following package : http://sajino.sajinet.com.pe/ganglia/ganglia-3.1.5.2147.tar.gz this one should also had fix the issue you had previously while trying to build ganglia (BUG215) [1] also, unless you have a very good reason not to, you are most likely better of building ganglia as a 32bit application (with 32bit dependent libraries) which are known to be tested more. eithercase the behaviour you see is a most likely a BUG and will be a good idea to track it down, report it and get it fixed in the long run. Carlo [1] http://sajino.sajinet.com.pe/ganglia/ganglia-3.1.5.2147.tar.gz -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Problems with Ganglia web interface
On Wed, Dec 09, 2009 at 05:47:39PM +0100, Samuel Gimeno wrote: I compiled from source in an OpenSUSE 11.1, all gone well but I have some problems with web interface, alternative it shows some errors messages: Ganglia cannot find a data source. Is gmond running?Cannot find any metrics for selected cluster janus, exiting. Check ganglia XML tree (telnet 127.0.0.1 8651) There was an error collecting ganglia data (127.0.0.1:8651): XML error: Invalid document end at 53 you mean it sometimes doesn't show that message and just works?, or it always shows an error regardless of how many times you reload or hit the get fresh data button? if the earlier you have most likely a problem with some hostname or metric that is generating invalid XML, or other problem that makes your XML invalid sometimes (a very slow gmetad, or very slow IO, or a bug on the version of ganglia you compiled and that forgot to mention), if the later you probably have a problem with selinux, apparmor, iptables or some other equivalent system (don't have opensuse, so can't confirm) which is preventing the ganglia web application to connect to the gmetad process in port 8651 and 8652 (which you forgot to confirm but would assume from your explanation is the same host were the web application is installed as recommended) And gmond is running, the ganglia XML tree show metrics and XML is correct, I try telnet localhost 8649 and telnet localhost 8651 and host running gmetad and telnet localhost 8649 gives an Xml valid and with metrics. Gmond and Gmetad works well in host and clients. if the problem is sporadic you might be able to do dump the contents of TCP:8651 in a loop and pass them to an xml validator (like xmllint) for hints. I'm going crazy trying to make it works, its the second time that I install and then works well, only some problems of misconfiguration. sadly opensuse doesn't have AFAIK official packages for ganglia 3 but there are some unnoficial ones you might be able to use from : http://software.opensuse.org/search?baseproject=openSUSE%3A11.2p=1q=ganglia Carlo -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] multicast source address on Solaris
On Wed, Dec 09, 2009 at 06:02:18PM +, River Tarnell wrote: Carlo Marcelo Arenas Belon: the attached draft patch should correct the problem thanks - after applying this patch, mcast_if works correctly. will this be included in the next release? it is already committed to trunk (what will be 3.2 sometime) with r2121 but hasn't yet been backported to 3.1 (which is now getting prepared for 3.1.6). Carlo -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Hadoop and ganglia
On Wed, Dec 09, 2009 at 04:49:43PM -0500, John Martyniak wrote: /* Feel free to specify as many udp_send_channels as you like. Gmond used to only support having a single channel */ udp_send_channel { mcast_join = 239.2.11.71 port = 8649 ttl = 1 } /* You can specify as many udp_recv_channels as you like as well. */ udp_recv_channel { mcast_join = 239.2.11.71 port = 8649 bind = 239.2.11.71 } tcp_accept_channel { port = 8649 } This is the config from the gmetad.conf: data_source my cluster 10.1.1.25 cloud1 cloud1 doesn't make sense there, unless that is the hostname of one of the servers running gmond and that you want to have as a backup of 10.1.1.25 everything else is commented out. what else is commented out?, hope you don't mean everything but what you showed above for gmond.conf was commented out but only the information in gmetad.conf, right? Also this is version 3.0.7. BTW if you want to use the default configurations you can even run without a gmond.conf, but be sure to get the right matching configuration on your hadoop cluster and restart everything. some more information of interest, including a nice detailed description from : http://ganglia.info/?p=88 Carlo -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Problems with Ganglia web interface
On Wed, Dec 09, 2009 at 11:02:14PM +0100, Samuel Gimeno wrote: The version compilled was 3.1.2, which command of xmllint should I use to see that Xml is well formed? assuming you have also netcat (or nc) are running this from the gmetad server : $ netcat 127.0.0.1 8651 | xmllint --noout - echo OK || echo BROKEN -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Gaps in graphs
On Sun, Dec 06, 2009 at 09:10:06AM +, Daniel Pocock wrote: The code from trunk does support the new rrdcached feature that comes with newer versions of RRDTool, but that code is currently still in development. rrdtool 1.4 was released already, including rrdcached, what is still in development? Some very big sites use that in production already - it is backported on the 3.1.3/4/5 betas - it is highly recommended doesn't seem to be a reference to it on the release notes or documentation and the code from trunk that references to it doesn't seem to be committed in 3.1, could you elaborate? Carlo -- Join us December 9, 2009 for the Red Hat Virtual Experience, a free event focused on virtualization and cloud computing. Attend in-depth sessions from your desk. Your couch. Anywhere. http://p.sf.net/sfu/redhat-sfdev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Ganglia 3.1.5 beta ready for final testing
On Mon, Dec 07, 2009 at 05:19:47PM +, Paul Sobey wrote: On Mon, 7 Dec 2009, Carlo Marcelo Arenas Belon wrote: It should work fine but if you are feeling brave and want to try what would happen also if using C99 then try (beware the name of the package was changed but will still unpack over the same directory, so you need to cleanup after building one package if looking at the other): http://sajino.sajinet.com.pe/ganglia/el4-ganglia-3.1.5.2145.tar.gz (!C99) http://sajino.sajinet.com.pe/ganglia/f12-ganglia-3.1.5.2145.tar.gz (C99) the package that was bootstrapped in fedora 12 (f12) would use C99 to build the ganglia sources while the other package (el4) should be otherwise equivalent to the one you tested before. I haven't bothered trying the el4 package since you stated it was (probably) equivalent. Confirm the the f12 package doesn't compile against a new python 2.6.2 giving the same error as mentioned earlier in the thread. Oops, somehow forgot to mention the workaround for that which will be to do before calling configure the following : $ ac_cv_prog_cc_c99=no $ export ac_cv_prog_cc_c99 Could you validate the following package works in your setup : http://sajino.sajinet.com.pe/ganglia/ganglia-3.1.5.2147.tar.gz (*) It works in Solaris 10u8 x86 using both SUNWgcc and Sun Studio 12u1 and built against SUNWPython at least but it will most likely need to be worked around to disable C99 in Solaris 8 and 9 (CC Daniel to check that) Carlo (*) not the official 3.1.5.2147 -- Return on Information: Google Enterprise Search pays you back Get the facts. http://p.sf.net/sfu/google-dev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Ganglia 3.1.5 beta ready for final testing
On Sun, Dec 06, 2009 at 05:09:08PM +, Paul Sobey wrote: On Saturday 05 December 2009 18:01:00 Carlo Marcelo Arenas Belon wrote: (*) this is not the official 3.1.5.2139 package but one that was patched additionally with relevant backports from trunk and bootstrapped in Centos 4 for added injury. Compiles perfectly against a new python 2.6.2 - thankyou! I thought the following thread might be of use: http://bugs.python.org/issue1759169 but it would seem that I was either wrong or that you've worked around it. that it is still relevant for your python binaries as you would have otherwise problems if using C99. the ganglia package you used worked around it by not building with C99 (which is what bootstrapping with CentOS 4 does) but still getting a fix for BUG215 so it wouldn't need to tweak CC or CFLAGS : http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=215 I haven't tested it yet - will do so tomorrow. It should work fine but if you are feeling brave and want to try what would happen also if using C99 then try (beware the name of the package was changed but will still unpack over the same directory, so you need to cleanup after building one package if looking at the other): http://sajino.sajinet.com.pe/ganglia/el4-ganglia-3.1.5.2145.tar.gz (!C99) http://sajino.sajinet.com.pe/ganglia/f12-ganglia-3.1.5.2145.tar.gz (C99) the package that was bootstrapped in fedora 12 (f12) would use C99 to build the ganglia sources while the other package (el4) should be otherwise equivalent to the one you tested before. I was under the impression that the python module was needed to get per volume disk stats, is that not the case? there is a python module called multidisk.py that does that, but AFAIK it doesn't support ZFS and is linux (might work partially in others) only though. We have lots of zfs volumes and I'd like to be able to graph usage of each, and hopefully use rrd's trending to get some sort of prediction when I'll need a bigger fileserver. fixing multidisk.py to understand zfs (specially the difference between zpools and file systems) would be a solution for that and unless you decide to ignore the extra metrics which will be created otherwise, most likely (if it even works) you can also make a script which will collect the values you will be interested and use gmetric to generate the right metrics as an alternative (which doesn't require python support and would work even with ganglia 3.0) Also is it safe to use this version with a collector gmond @3.1.2 or should I upgrade all? That's a slightly more tiresome operation which take me a while a longer... all version of 3.1.x are compatible between them, so you don't really need to upgrade all the others, unless you want to. Carlo -- Join us December 9, 2009 for the Red Hat Virtual Experience, a free event focused on virtualization and cloud computing. Attend in-depth sessions from your desk. Your couch. Anywhere. http://p.sf.net/sfu/redhat-sfdev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Ganglia 3.1.5 beta ready for final testing
On Fri, Dec 04, 2009 at 04:12:30PM +, Paul Sobey wrote: On Wed, 25 Nov 2009, Carlo Marcelo Arenas Belon wrote: On Tue, Nov 24, 2009 at 09:05:12PM -0800, Bernard Li wrote: I should clarify, what I meant to say was: The resulting tarball should build fine on Solaris without Python support I don't believe this is a regression from previous release(s) not really; 3.1.4 builds fine with python support in Solaris 10 as shown in the thread you linked to : http://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg05098.html the new bootstrapping might also affect Solaris 8 or older releases, which is why I don't think that expediting this release will make much sense, specially considering that there are almost no benefits IMHO. Apols but I still can't make 3.1.4 build with python support - and there are no csw dependencies anywhere in the chain for me. I welcome suggestions though - I've just realised I need the python module to watch the 200 zfs filesystems on one of my thumpers! you are a brave man, and eventhough you had done already your fair share of beta testing hope you don't mind trying a patched snapshot of what would hopefully become later 3.1.6 and which might solve your problem : http://sajino.sajinet.com.pe/ganglia/ganglia-3.1.5.2139.tar.gz (*) it of course builds fine for me using the setup described on the linked URL above. also curious on how having working python support would help with zfs? Carlo (*) this is not the official 3.1.5.2139 package but one that was patched additionally with relevant backports from trunk and bootstrapped in Centos 4 for added injury. -- Join us December 9, 2009 for the Red Hat Virtual Experience, a free event focused on virtualization and cloud computing. Attend in-depth sessions from your desk. Your couch. Anywhere. http://p.sf.net/sfu/redhat-sfdev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.5 beta ready for final testing
On Wed, Dec 02, 2009 at 07:41:39PM +, Daniel Pocock wrote: Therefore, the approach might need to be some combination of the solutions. E.g. a configure option that allows people to choose the new behaviour or the old behaviour. -1, this will double our supported paths for almost no gain and knowing that at least 50% are broken, and still underscores the nature of the problem. because changing the initialization would affect also (in a platform specific way) things like threaded gmond modules and the resources they rely on just as an example. As we know the new behaviour works on Solaris and Linux with the version of APR that was tested it with, which is also a moving target. then the package can be built the new way on those platforms by default. On BSD, users could choose what they want by setting a configure option. If a user had an updated apr (provided such update is feasible), they might compile with the new behaviour. again, this is not a BSD specific problem (indeed I suspect that solaris might be affected as well, specially in cases where APR was compiled to use port_getn), because then apr_poll_* has slightly different semantics than poll and therefore could result in platform specific failures that might not be as obvious as it was kqueue for the BSD. the problem that we were trying to solve was just to propagate correctly the status from the gmond daemon to the caller and for a proof of concept in that direction (as suggested before) refer to : http://www.mail-archive.com/ganglia-develop...@lists.sourceforge.net/msg05390.html Carlo -- Join us December 9, 2009 for the Red Hat Virtual Experience, a free event focused on virtualization and cloud computing. Attend in-depth sessions from your desk. Your couch. Anywhere. http://p.sf.net/sfu/redhat-sfdev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.5 beta ready for final testing
On Wed, Dec 02, 2009 at 01:57:44AM +, Carlo Marcelo Arenas Belon wrote: On Tue, Dec 01, 2009 at 10:20:32PM +, Daniel Pocock wrote: - Can you easily re-compile APR with a different poll implementation? I think you can change it from configure. Which option?, --enable-other-child doesn't make a difference and considering how many different versions of APR are installed in all affected systems I would be surprised this to be an APR issue. and surprised I am, as the problem goes away if APR is forced to use poll instead of kqueue. but that of course requires a patched version of apr (including bootstrapping) and is probably not an option, unless we go back to the dark ages of including all dependencies statically. if anyone is interested I am attaching a patch for apr-1.3.9 which could be used to fix this problem in {Free,Net,Open}BSD and which will also require that ganglia be linked with the patched library by doing something like (using /opt/ganglia to avoid clashing with the system provided packages and ignoring the fact that you would need to be root with a bourne shell to execute the following incantation, and that is very unlikely to be a good idea anyway) : # mkdir -p /opt/ganglia # tar -xvzf apr-1.3.9.tar.gz # cd apr-1.3.9 # patch -p1 apr-1.3.9-configure-disablekqueue.patch # ./confgure --prefix=/opt/ganglia # make # make install # cd .. # tar -xvzf ganglia-3.1.5.tar.gz # cd ganglia-3.1.5 # ./configure --prefix=/opt/ganglia --with-libapr=/opt/ganglia/bin/apr-1-config # make # make install # LD_LIBRARY_PATH=/opt/ganglia/lib /opt/ganglia/bin/gmond Carlo PS. DragonFlyBSD will be still affected and MacOS X was probably luckily not --- apr-1.3.9/configure Mon Sep 21 14:59:34 2009 +++ apr-1.3.9/configure Wed Dec 2 01:45:45 2009 @@ -5762,6 +5762,10 @@ ac_cv_o_nonblock_inherited=yes fi + if test -z $ac_cv_func_kqueue; then +test x$silent != xyes echo setting ac_cv_func_kqueue to \no\ +ac_cv_func_kqueue=no + fi ;; *-netbsd*) @@ -5792,6 +5796,10 @@ ac_cv_o_nonblock_inherited=yes fi + if test -z $ac_cv_func_kqueue; then +test x$silent != xyes echo setting ac_cv_func_kqueue to \no\ +ac_cv_func_kqueue=no + fi ;; *-freebsd*) @@ -5838,15 +5846,12 @@ fi fi -# prevent use of KQueue before FreeBSD 4.8 -if test $os_version -lt 48; then - +# prevent use of KQueue if test -z $ac_cv_func_kqueue; then test x$silent != xyes echo setting ac_cv_func_kqueue to \no\ ac_cv_func_kqueue=no fi -fi ;; *-k*bsd*-gnu) -- Join us December 9, 2009 for the Red Hat Virtual Experience, a free event focused on virtualization and cloud computing. Attend in-depth sessions from your desk. Your couch. Anywhere. http://p.sf.net/sfu/redhat-sfdev2dev___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.5 beta ready for final testing
On Wed, Dec 02, 2009 at 10:36:02AM +, Daniel Pocock wrote: Carlo Marcelo Arenas Belon wrote: but that of course requires a patched version of apr (including bootstrapping) and is probably not an option, unless we go back to the dark ages of including all dependencies statically. Maybe we can fix apr too Changing APR to use poll instead of kqueue is a way to fix it, but then to be able to use that fix, we will need to go back to have our own apr tree. I found a description of the same issue in a Google search on the subject: http://www.google.com/search?q=kqueue+fork+process+bad+file+descriptor; http://www.mail-archive.com/freebsd-hack...@freebsd.org/msg69516.html Can you try re-enabling kqueue and patching apr to use rfork()? Doesn't work, and fails now on sending of the metrics, because of course this time the parent process close that socket and the child can use it after that. The only viable solution I see is to delay the creation of all the sockets until daemonized as it was being done originally. If you really need to avoid having the parent report back on issues on that then you are going to keep the parent around and send the status back from the child until getting into the main loop through a unix socket or similar instead as you suggested originally was another option. Carlo -- Join us December 9, 2009 for the Red Hat Virtual Experience, a free event focused on virtualization and cloud computing. Attend in-depth sessions from your desk. Your couch. Anywhere. http://p.sf.net/sfu/redhat-sfdev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.5 beta ready for final testing
On Wed, Dec 02, 2009 at 11:17:26AM +, Daniel Pocock wrote: Carlo Marcelo Arenas Belon wrote: On Wed, Dec 02, 2009 at 10:36:02AM +, Daniel Pocock wrote: Can you try re-enabling kqueue and patching apr to use rfork()? Doesn't work, and fails now on sending of the metrics, because of course this time the parent process close that socket and the child can use it after that. The only viable solution I see is to delay the creation of all the sockets until daemonized as it was being done originally. The problem with that is that if another process is already listening on one of the ports wanted by gmond, then the listener set up will fail, but if the problem is only detected after daemonizing, then the caller doesn't know about the failure. but that is something that could be fixed at the caller level but just checking if the port is bound to something already before calling gmond. agree that is not elegant, but is better than the current situation where you can't start gmond at all. If you really need to avoid having the parent report back on issues on that then you are going to keep the parent around and send the status back from the child until getting into the main loop through a unix socket or similar instead as you suggested originally was another option. That is not as easy to implement in apr as the apr_proc_detach() call. frankly I don't like much all the abstractions that apr_* provides because makes simple things like this more complicated (specially because of the unintended sideeffects) but since apr_proc_detach is just calling fork and reopening the 3 std filehandles shouldn't be that difficult to work around. apr_proc_fork() is described as the only call in apr that is not portable. apr_proc_create() could be used to invoke another gmond process, but I'm not sure that apr guarantees to preserve the file descriptors and memory allocations across that call. apr_proc_fork() is not called by apr_proc_detach() AFAIK, indeed I was surprised to see it even existed when noticed that apr_proc_detach calls fork() directly. Maybe the problem has something to do with the way detach recycles stdin/stdout/stderr? As a quick test, could you try modifying gmond.c so that it calls fork() directly rather than calling apr_proc_detach()? fork() doesn't work because the kqueue filehandle is not inherited; using rfork() instead doesn't either because all filehandles are closed by doing exit(0) in the parent and so fails in the same way that changing apr_proc_detach() does when changed to use rfork() instead. Carlo -- Join us December 9, 2009 for the Red Hat Virtual Experience, a free event focused on virtualization and cloud computing. Attend in-depth sessions from your desk. Your couch. Anywhere. http://p.sf.net/sfu/redhat-sfdev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.5 beta ready for final testing
On Wed, Dec 02, 2009 at 11:48:51AM +, Daniel Pocock wrote: fork() doesn't work because the kqueue filehandle is not inherited; using rfork() instead doesn't either because all filehandles are closed by doing exit(0) in the parent and so fails in the same way that changing apr_proc_detach() does when changed to use rfork() instead. I'm not a BSD expert, do you know if there is any ioctl or something that can be used to tell BSD to keep the file descriptors for the child process? not a BSD expert either, but I would think that would be very unlikely. I would suggest reverting r2025 in trunk and start looking for an alternative solution, but would be probably just easier to revert r2043 for 3.1 as well to solve the release blocker, with the possibility of adding some logic to the init script to try to help with the test case you were trying to prevent by the original feature. Carlo PS. apache httpd must have a solution as they don't seem to have kqueue disabled, but that solution is probably just to delay the port binding as was done originally (except that they manage better the failures) -- Join us December 9, 2009 for the Red Hat Virtual Experience, a free event focused on virtualization and cloud computing. Attend in-depth sessions from your desk. Your couch. Anywhere. http://p.sf.net/sfu/redhat-sfdev2dev ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.5 beta ready for final testing
On Mon, Nov 30, 2009 at 01:29:34PM +, Daniel Pocock wrote: Carlo Marcelo Arenas Belon wrote: Your call, eventhough a fix for this feature will be probably preferred as there is nothing special about the BSD for them to be affected and it might be that the problem is therefore more generic. It may be that this bug is revealing a more serious issue in the way initialisation is done, so I would prefer to know the real cause rather than just revert the change that forces the problem to show itself. agree and as I said before the reason why I didn't just revert it from trunk or 3.1 as a fix even if it seems to resolve the problem. At least a revert would be needed for 3.1 as this accounts for a regression but haven't done so either waiting for you to first revert it on trunk and then decide on how to proceed from there depending on how critical this feature was for the release. I agree that it is a recession, but reverting it may cause the real culprit to remain hidden. I'd rather hold the release while we look more closely. not sure if I understand what you meant here, since it would be obvious to me that 3.1.5 can't be released if a fix (even if it is just reverting the change) is committed. are you saying you want to hold of on deciding to release or not 3.1.5 or to see what will be in 3.1.6?, if the later I would suggest also pulling some other fixes and of course that would also require for us to agree on a bootstrapping environment for this release at least. The change has been working on Linux, Solaris and Cygwin. Other than just doing a manual bisect (using git instead of svn here would had been useful) to find where the problem was introduced and validate that reverting it corrects the problem haven't done much analysis of it, but the fact that it broke in such a strange way (was indeed expecting the culprit to be somewhere else, specially considering all recent changes in the networking and the fact that it seemed originally to be triggered by a TCP request) probably points to a bigger issue which just happens to have not been visible on the configurations used to test Linux, Solaris and Cygwin, specially considering how pervasive it was (broke all BSD I had access to test, at least) Can you provide output from strace/truss and also a stack trace from the point where it is in the infinite loop? filed BUG246 with the trace information (collected from OpenBSD 4.5 amd64) using ktrace, but you got me there. from the way the problem represents itself isn't really obvious were the offending code is and is difficult to debug as well since it dissapears when in debug mode or not running as a daemon, which is the reason why I haven't been able to capture a backtrace yet either. There is a good reason for moving the daemonize code the way I did - an alternative would be to daemonize, but make the original process hang around until the daemon process has entered the main loop. OK, and assume it is probably related to the cases were gmond suddenly dies at startup without notification but some clarification on what was the problem you were trying to solve would be probably usefull too. Carlo -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.5 beta ready for final testing
On Tue, Nov 24, 2009 at 06:03:51PM -0800, Bernard Li wrote: Please help us test on as many OS/archs as possible, as this would go GA quite immediately ;-) FreeBSD is not able to return any XML data through TCP/8649 (tested with FreeBSD 8.0 amd64). DragonFlyBSD fails to build but a 3.2 version of ganglia which includes fixes for that fails with the same TCP issue than FreeBSD and so this issue might be affecting other BSD as well. Carlo -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] multicast source address on Solaris
On Wed, Nov 25, 2009 at 10:57:37AM +, River Tarnell wrote: Carlo Marcelo Arenas Belon: have you added mcast_if to both the udp_send_channel and udp_recv_channel configurations? i only had it for send_channel. i've added it to recv_channel as well, but it seems to make no difference. you need to have it added for both (unless you are routing multicast packets between both interfaces) or you won't be able to see your own metrics. thanks, the attached draft patch should correct the problem, please disregard all instructions about adding routes as I noticed later that I had misunderstood your report and was therefore assuming the problem was you couldn't send the multicast traffic through the interface and not that it was just going with a incorrect source IP. Carlo BZh91AYSY5?H _?{?z??~??P??f? (q??CM??...@??h?l 4?4?h q...@ b?j4d?q??z???4?0??#?m?=ocp?M?? ???L???A???! ????=?hh???55???G!4t??y?/?[^r???am+|e?b?M?#'6?a @(???^??%?:??4??}???%Id?t?? h? ?3 ???h??j??????...@?@0?D{`|+y?J?o???,?:N m?I? ?UY K???e?\aJ?? EU?e?/`????j ??L??\pgh?Lm??c3???IU#I??^??n9;D? 0?u??????? ? k?WX?u`1:???RC?!?)9?????s?8?cf6??]???Bc(?}z 1?L-??P~,c(/@??+O?F??2?}?e*w_? (w?g?y?0?dB?'\??0pb-????!WO?? ^`lG?TZ???~?!?P????aa??F???]??J?D??!?^?2????...@???c?f?d?@???t? ?? ??*?B?!4?$u???/?L????fO}?{)? ?? ?? a?P???9??4???] ?[?-??-'?X G=?j2\!?a ...@j)i???*e?$0?~???B?p? ?5?B'S?q???#W??Mxb??e???^??g???9!?Q[ X ?.?hM444???V?!$*??? ?t?Z??rV$j??nv?wP? ?DK?rE8P?5?H-- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] multicast source address on Solaris
On Wed, Nov 25, 2009 at 07:45:46AM +, River Tarnell wrote: Carlo Marcelo Arenas Belon: FYI gmond in Solaris 10 zones has problems as tracked by BUG100 : there are no zones configured on the system. these are IP Multipathing interfaces. OK, and the interfaces you are trying to use are physical interfaces? what do you get with : # dladm show-link are you expecting to send the multicast traffic through a specific VLAN? by default, gmond (or Solaris?) seems to choose the first available interface for the source address, which is 172.16.0.129. AFAIK Solaris does by adding a route to 224.0/4 through that interface as shown by `netstat -rn` i see no route for 224/4 on this system: Routing Table: IPv4 Destination Gateway Flags Ref Use Interface - - - -- - default 91.198.174.193 UG1 92511 10.24.1.010.24.1.24 U 1 6471 bge301000 10.24.1.010.24.1.25 U 1 2294 bge301001 10.24.1.010.24.1.24 U 1 0 bge301000:1 10.24.1.010.24.1.25 U 1 0 bge301000:2 10.24.1.010.24.1.25 U 1 0 bge301001:1 10.24.1.010.24.1.25 U 1 0 bge301001:2 10.24.1.010.24.1.25 U 1 0 bge301001:3 91.198.174.192 91.198.174.204 U 1 7843 bge102000 91.198.174.192 91.198.174.208 U 1893 bge102001 91.198.174.192 91.198.174.204 U 1 0 bge102000:1 91.198.174.192 91.198.174.208 U 1 0 bge102000:2 91.198.174.192 91.198.174.208 U 1 0 bge102001:1 91.198.174.192 91.198.174.204 U 1 0 bge102001:2 172.16.0.128 172.16.0.129 U 1 1003 nge0 172.16.1.0 172.16.1.1 U 1 1020 nge1 172.16.4.0 172.16.4.1 U 1742 clprivnet0 239.2.11.71 91.198.174.204 UGH 1 0 127.0.0.1127.0.0.1UH 191055435 lo0 curiously, the route is present on our other (single-interface) Solaris systems. # route delete -interface 224.0/4 -gateway 172.16.0.129 # route add -interface 224.0/4 -gateway 91.198.174.204 damiana# route delete -interface 224.0/4 -gateway 172.16.0.129 delete net 224.0/4: gateway 172.16.0.129: not in table damiana# route add -interface 224.0/4 -gateway 91.198.174.204 add net 224.0/4: gateway 91.198.174.204 after restarting gmond, it's still sending packets from 172.16.0.129. does it show that is subscribed then in that interface and snoop see the packets coming out from that interface as well? # netstat -g have you added mcast_if to both the udp_send_channel and udp_recv_channel configurations?, if it still doesn't work, could you get the output from the first seconds of : # gmond -d9 and a truss of the same? Carlo -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Ganglia 3.1.5 beta ready for final testing
On Tue, Nov 24, 2009 at 09:05:12PM -0800, Bernard Li wrote: I should clarify, what I meant to say was: The resulting tarball should build fine on Solaris without Python support I don't believe this is a regression from previous release(s) not really; 3.1.4 builds fine with python support in Solaris 10 as shown in the thread you linked to : http://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg05098.html the new bootstrapping might also affect Solaris 8 or older releases, which is why I don't think that expediting this release will make much sense, specially considering that there are almost no benefits IMHO. Carlo -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] multicast source address on Solaris
On Wed, Nov 25, 2009 at 06:25:37AM +, River Tarnell wrote: i'm running gmond 3.1.5 on a Solaris 10 system with several interfaces: FYI gmond in Solaris 10 zones has problems as tracked by BUG100 : http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=100 by default, gmond (or Solaris?) seems to choose the first available interface for the source address, which is 172.16.0.129. AFAIK Solaris does by adding a route to 224.0/4 through that interface as shown by `netstat -rn` however, i would prefer it to use 91.198.174.204 on bge102000. i first tried adding a route for the multicast destination: damiana# route -p add -host 239.2.11.71 91.198.174.204 but that made no difference, so i tried this instead: damiana# route -p add -host 239.2.11.71 91.198.174.204 -interface what happens if you do instead : # route delete -interface 224.0/4 -gateway 172.16.0.129 # route add -interface 224.0/4 -gateway 91.198.174.204 which still didn't work. i tried adding 'mcast_if = bge102000' to udp_send_channel. this made no difference either. this is a bug, most likely but might be expected if bge102000 is a zone interface instead of real hardware as explained on BUG100. Carlo -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.4 beta ready for testing
On Mon, Nov 02, 2009 at 03:05:32PM -0800, Bernard Li wrote: Can you please test this tarball bootstrapped on Fedora 9. It works, but would invalidate all testing that was done for 3.1.3 and the original 3.1.4. If it works I will replace the original tarball with this: http://ganglia.info/testing/bootstrapped_on_fedora9/ganglia-3.1.4.tar.gz -1 Changing the release package in the middle of a release is a bad idea; indeed changing it without bumping the release version goes against our release procedures, as it could result in different binary packages and was the reason why the unofficial package I provided was published far from the ganglia servers to hopefully avoid any confusion and frustration if it was found later that someone finds a bug which happens to be only reproducible in the other version. There is also the risk of introducing a bug (like the one in 3.1.2 from bootstrapping in SuSE with automake 1.9.6 which prevented users that had the 32bit libraries for apr installed on 64bit systems to get a working build) and so as much as I am excited about finally moving to some more modern versions of autotools, this make only sense as part of 3.1.5, and which will hopefully also allow for enough time to remove all needed hacks and finally cleanup the bootstrapping code. Carlo -- Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.4 beta ready for testing
On Mon, Nov 02, 2009 at 11:09:40PM +, Daniel Pocock wrote: I note Paul is using gcc, whereas I'm building and testing with Sun Studio on the OpenCSW build farm - Sun's compiler is now a free download, and it is used to build all the CSW libraries (including those used by Ganglia), so this is now the easiest solution to support - that, and Solaris 8 support, led me to tweak the configure.in stuff for Solaris - maybe it needs more tweaking to support gcc - would anyone like to comment on the preferred gcc build environment to be supported? IMHO any gcc should work, and indeed gcc was the originally supported compiler for ganglia in Solaris (Sun Studio was added later in 3.1.1 when it was made freely available with OpenSolaris). while working on libmetrics (as can be seen in the corresponding metrics.c file) the following versions of gcc were used (most of them using SUNWtoo and other SUNW provided tools as part of the toolchain when possible) : Solaris 7 x86 (32-bit) with gcc-2.8.1 (this one used GNU binutils AFAIK) Solaris 8 (64-bit) with gcc-3.3.1 Solaris 9 (64-bit) with gcc-3.4.4 Solaris 10 SPARC (64-bit) and x86 (32-bit and 64-bit) with SUNWgcc On the issue of the gcc environment, we basically need a second version of scripts/build-solaris.sh for gcc - this raises questions like should the libraries (apr, confuse) be built with gcc too? Which ld, ar, etc? This is IMHO a packager call after all we don't provide binaries (well we do but almost no one uses them) because as you pointed out the decision on which toolchain to use needs to be made at the distribution or system engineering level and so we are left to support them all the best we can. In cases were there is some overlap (like in the case of the CSW packages, where the package maintainers are also upstream contributors) or when it helps to simplify maintenance on a specific platform (like the CentOS 4 RPMs or the Makefile.WiX recipes for Cygwin) then it makes sense to have some additional code to help with it and also some more testing or confidence about the resulting binaries working as expected, but that shouldn't be ever considered as the only supported solution IMHO. Carlo -- Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Ganglia 3.1.4 beta ready for testing
On Tue, Oct 27, 2009 at 09:52:52AM +, Paul Sobey wrote: /usr/include/sys/feature_tests.h:336:2: error: #error Compiler or options invalid; UNIX 03 and POSIX.1-2001 applications require the use of c99 make[2]: *** [getopt1.o] Error 1 Googling leads me to try compiling with CFLAGS=-std=gnu99 per: http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=215 this is a bug on the autoconf from CentOS 4 which is used to build the release packages, therefore you can also workaround the issue by rebootstrapping the package or making your own with a better version of the autotools. for simplicity I'd uploaded an unofficial release package for 3.1.4 bootstrapped on fedora rawhide in : http://sajino.sajinet.com.pe/ganglia/ganglia-3.1.4.tar.gz If do that, compilation fails building against Python 2.6.2 (built with same toolchain): once you use -std=gnu99 is no longer the same toolchain and therefore building python with the same standard support should solve your problem. Carlo -- Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Ganglia 3.1.4 beta ready for testing
On Thu, Oct 29, 2009 at 04:44:59PM +, Paul Sobey wrote: I note from the Makefile Daniel posted: # Depends: some issues exist getting the Python support working on Solaris, # Ganglia's configure.in needs to be further enhanced for this to work I think this is a CSW specific problem, as I had no problem getting python support compiled in Solaris 10u7 x86 using SUNWPython-devel, SUNWgcc, SUNWlexpt and compiled versions of confuse and apr. $ PATH=$PATH:/usr/sfw/bin:/usr/ccs/bin $ ./configure CC=gcc -std=gnu99 --prefix=/usr/local --with-libarp=/usr/local/apr/bin/apr-1-config --with-libconfuse=/usr/local $ make Daniel, could you elaborate? Carlo -- Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.4 beta ready for testing
On Thu, Oct 29, 2009 at 08:42:05PM +, Daniel Pocock wrote: Carlo Marcelo Arenas Belon wrote: On Thu, Oct 29, 2009 at 04:44:59PM +, Paul Sobey wrote: I note from the Makefile Daniel posted: # Depends: some issues exist getting the Python support working on Solaris, # Ganglia's configure.in needs to be further enhanced for this to work Daniel, could you elaborate? Although I have described the Python module in the CSW Makefile, it is not something I have properly tested. OK and I haven't done any testing either, other than making sure it builds and that a mod_example like module can be loaded, but my question was more about the need to change configure.in to support python modules which you were referring about in the Makefile as Paul noted. I am still working through some core agent problems (e.g. see the discussion on csw-maintainers about building a 64 bit version of everything: I've noticed that when running a 32 bit binary on some 64 bit machines with lot's of RAM, some kstat calls lead to a seg fault) care to provide a link to the thread or any bug reports?, earlier releases for 3.0 required 64bit binaries as they were reading kernel memory directly to gather the statistics, but after those metrics were migrated to kstat that shouldn't be an issue anymore, and I am running some 32-bit 3.0 agent with solaris sparc with significant amount of memory as well, so there might be a regression to track here. Carlo -- Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Ganglia 3.1.4 beta ready for testing
On Thu, Oct 29, 2009 at 01:10:14PM -0700, Bernard Li wrote: On Thu, Oct 29, 2009 at 12:01 PM, Carlo Marcelo Arenas Belon care...@sajinet.com.pe wrote: this is a bug on the autoconf from CentOS 4 which is used to build the release packages, therefore you can also workaround the issue by rebootstrapping the package or making your own with a better version of the autotools. ?for simplicity I'd uploaded an unofficial release package for 3.1.4 bootstrapped on fedora rawhide in : ?http://sajino.sajinet.com.pe/ganglia/ganglia-3.1.4.tar.gz Do you have a link for the bug, and are you aware whether there are updates for CentOS 4 to fix the issue? I am not aware of a CentOS or RHEL bug report, but considering that EL4 is in maintenance mode there won't be a fix anyway (2.59 was released in 2003 and the last update to package was in 2004) I guess I could start building on CentOS 5, provided that the autoconf does not have this bug. CentOS 5 also uses autoconf 2.59 so wouldn't help with this problem, but might hopefully allow us remove all the kludges that were added to workaround the libtool 1.5.6 bugs which were preventing DragonFlyBSD support. Ideally, which platform is used to bootstrap shouldn't be relevant though and IMHO we should be instead aiming to the latest versions of the autotools (either installed by hand or provided as part of the distribution if more development focused) and for that when on Linux usually means Fedora, Gentoo or Debian IMHO. Carlo -- Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Ganglia 3.1.4 beta ready for testing
On Mon, Oct 26, 2009 at 04:51:33PM -0700, Bernard Li wrote: Ganglia 3.1.4 is ready for testing at: http://ganglia.info/testing/ DragonFlyBSD fails to build (tested with 2.4.0 32bit). not a regression (a system header problem which also affects 3.1.2) and there are some trivial unrelated changes in trunk which could help with that. Carlo -- Come build with us! The BlackBerry(R) Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9 - 12, 2009. Register now! http://p.sf.net/sfu/devconference ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] in which file do gmetad writes information regarding total cpu load?can it be fetched or read?
On Mon, Sep 07, 2009 at 12:19:26AM +0530, pankaj dorlikar wrote: each metric (cpu_load included) is stored by gmetad in an rrd file which can be manipulated further using rrdtool. there is 1 directory per each node monitored in a directory structure plus 1 additional directory per each summarization domain inside the directory that rrd_rootdir in gmetad.conf points to. Carlo -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] ganglia shows cluster down
On Tue, Jul 28, 2009 at 10:11:33AM +0200, Igor Rosenberg wrote: I'd been witnessing very long updates when adding/restarting nodes, and wondered why sometimes it took like 15 minutes for the infrastructure to reflect the changes. it should be closer to 20 min as mentioned in the release notes (important notes) http://ganglia.wiki.sourceforge.net/ganglia_release_notes I'm using the ganglia-3.1.2 release (on Debian 4.0 machines). Did the (possible) fix you mention reach that version? there is no fix AFAIK, if you are using unicast using send_metadata_interval is the only available workaround. on either case and unless the protocol is changed in a probably incompatible way (hence not to be seen in 3.1.x), restarting gmond in the right order (and all of them per cluster) is the only available workaround for quickly updating infrequently polled metrics for in the 3.1 branch. Carlo -- Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day trial. Simplify your report design, integration and deployment - and focus on what you do best, core application coding. Discover what's new with Crystal Reports now. http://p.sf.net/sfu/bobj-july ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] embedded metric vs. libganglia?
On Wed, May 20, 2009 at 02:41:49PM -0700, Christopher Smith wrote: The Java one is actually pretty nice now, but I just didn't see why the C one would even be implemented in the first place given that the main project *already has* what is needed. My guess is that it was implemented as a way to isolate the users from API/ABI changes in libganglia since there is currently no public stable interface for it. practically speaking though the interface is pretty much not changing all that much anyway (3.0.7 to 3.1.0 was the only one I can recall). So to clarify: I shouldn't have any problems using libganglia to add metrics to my apps right? as far as you are OK on linking with libganglia and correcting any changes that might be needed if the API changes IMHO, and which might as well apply to the alternatives. Carlo -- Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT is a gathering of tech-side developers brand creativity professionals. Meet the minds behind Google Creative Lab, Visual Complexity, Processing, iPhoneDevCamp asthey present alongside digital heavyweights like Barbarian Group, R/GA, Big Spaceship. http://www.creativitycat.com ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Ganglia 3.1.2 -- Modules fail to load intermittently, Hosts disappear from the web interface, XML errors that can't be found
On Thu, May 14, 2009 at 11:47:41AM -0500, Adam Tygart wrote: I have been having a hack of a time diagnosing this problem. I suspect there are several problems here, which OS and architecture? I recently updated to ganglia-3.1.2 for 3.0.7. 3.1 and 3.0 are not compatible and can't be on the same cluster, so for this upgrade to be successfull you should have done : 1) upgrade your gmetad/web to 3.1.2 2) upgrade all gmond to 3.1.2, cluster by cluster in batches more details to be found in : http://ganglia.wiki.sourceforge.net/ganglia_release_notes Since then I have been plagued with (what looked like) data errors, mis-reporting swap usage was the easiest to see. could you elaborate here?, is the value that gmond is collecting on each node incorrect?, is the agregated in gmetad incorrect?, which one of the swap metrics is incorrect? # uname -a Linux dell 2.6.28-gentoo-r5 #1 SMP Thu Apr 23 21:35:08 PDT 2009 x86_64 Intel(R) Core(TM)2 CPU 6320 @ 1.86GHz GenuineIntel GNU/Linux # gmond --version gmond 3.1.2 # telnet 127.0.0.1 8649 | grep swap METRIC NAME=swap_total VAL=4008176 TYPE=float UNITS=KB TN=60 TMAX=1200 DMAX=0 SLOPE=zero EXTRA_ELEMENT NAME=DESC VAL=Total amount of swap space displayed in KBs/ Connection closed by foreign host. METRIC NAME=swap_free VAL=4008176 TYPE=float UNITS=KB TN=60 TMAX=180 DMAX=0 SLOPE=both EXTRA_ELEMENT NAME=DESC VAL=Amount of available swap memory/ # free | grep Swap Swap: 4008176 04008176 This seems to be caused by some reporting modules failing to load. They fail silently, I don't see logs about it anywhere, and when I turn debugging on I still don't see anything. AFAIK if a module fails to load because of an error it will just prevent gmond to start at all (some times silently) as detailed in the Known Issues. if the module is not loaded but it is still referred by the configuration for collecting it will also be very noisy about it : # /etc/init.d/gmond start * Starting GANGLIA gmond: ... Cannot locate internal module structure 'mem_module' in file (null): /usr/sbin/gmond: undefined symbol: mem_module Possibly an incorrect module language designation [(null)]. [ ok ] # tail /var/log/syslog | grep gmond May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric information for 'mem_total'. Possible that the module has not been loaded. May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric information for 'swap_total'. Possible that the module has not been loaded. May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric information for 'mem_free'. Possible that the module has not been loaded. May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric information for 'mem_shared'. Possible that the module has not been loaded. May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric information for 'mem_buffers'. Possible that the module has not been loaded. May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric information for 'mem_cached'. Possible that the module has not been loaded. May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric information for 'swap_free'. Possible that the module has not been loaded. what makes you think the module is not being loaded?, and that is being silent about that?, does it show in? : # lsof -p `pidof gmond` | grep ganglia Usually it is one of the modules, but I have had two occasionally happen at the same time. modmem.so and modnet.so are the two to most commonly fail. what is observed when they fail to load? I have restarted with a new gmond configuration, changing only the configuration of multicast to unicast, and this problem persists. this might have introduced another problem, for unicast to work somehow reliably you need to add a value for send_metadata_interval. I have wiped my old rrd data. I have tried everything I know that could even remotely be to blame for this problem. The question I have is this: is this a known bug? some are, like the unicast send_metadata_interval or the cpu_count inconsistency as shown by the Important Notes, some others might not be Is there something else I should try? rollback to 3.0, specially if you don't need the modules but want a more stable setup. Can I force a module to be loaded? no, but a module should never fail to load silently AFAIK When the modules do load, hosts report to gmond, and gmeta grabs that data and logs it. My webserver then serves up the data through the ganglia interface. The problem I am having here is that I get intermittent xml errors, mostly saying that there is a missing on line $SomeLineNumber (always changes). Happens every 15 minutes or so. I cannot reproduce any problems with the xml, however. I ran xmllint on the xml 1 per second for an hour with no errors, during which time the web interface
Re: [Ganglia-general] Ganglia 3.1.2 -- Modules fail to load intermittently, Hosts disappear from the web interface, XML errors that can't be found
On Fri, May 15, 2009 at 08:42:33AM -0500, Adam Tygart wrote: On Fri, May 15, 2009 at 04:32, Carlo Marcelo Arenas Belon care...@sajinet.com.pe wrote: On Thu, May 14, 2009 at 11:47:41AM -0500, Adam Tygart wrote: Since then I have been plagued with (what looked like) data errors, mis-reporting swap usage was the easiest to see. could you elaborate here?, is the value that gmond is collecting on each node incorrect?, is the aggregated in gmetad incorrect?, which one of the swap metrics is incorrect? Aggregate swap data being incorrect is the easiest to see. Here is the graph from a mis-reporting host (it doesn't always even send this information): http://imgur.com/io8gu.png Here is the resulting aggregate graph: http://imgur.com/trato.png The beginning of this graph is showing the correct data, I simply restarted gmond (on all non-webserver hosts), and the resulting swap usage was from one of them failing to send the correct data. OK, the metric value is not incorrect, but is not being reported at all which is why you have dips on your graph that fix themselves after several minutes. This is sadly a known issue, because of the way that gmond register metrics dynamically and the fact that some of those metrics aren't refreshed that frequently as described in the Release Notes (mentioning as an example the CPU count issues which is very visible), for more details in the discussion look at : http://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg04275.html An eventhough I agree it is a bug doesn't have yet a solution, and is not seen unless gmond is restarted (any of them) a workaround is available, but ensuring that if you have to restart a gmond you restart first its collector (the one that gmetad is looking at) and the rest are pointing to when using unicast, and restart ALL other gmond in the cluster after that. The question I have is this: is this a known bug? some are, like the unicast send_metadata_interval or the cpu_count inconsistency as shown by the Important Notes, some others might not be I haven't been able to find the Important Notes document, is there a link to this somewhere? sadly it is buried at the bottom of the Release Notes now : http://ganglia.wiki.sourceforge.net/ganglia_release_notes and yes I agree should be moved to a better place as well. Is the cpu_count inconsistency the piece I mentioned about hosts disappearing from the web interface? most likely the host disappearing from the web interface is because of the send_metadata_interval and you trying to restart the gmond to fix it. if it is not then we have a new bug ;) Is there something else I should try? rollback to 3.0, specially if you don't need the modules but want a more stable setup. This being Gentoo, I have no easy way of rolling back, as the 3.0.x builds have been removed from their tree. OK, IMHO having ganglia 3.0 in their tree as well with a different slot might be a good idea, but sadly I haven't yet filed it as a bug or can provide a working ebuild in a public overlay yet as a solution either, but of course you can still do your own binaries/packages if needed. 3.0 is still under development with 3.0.8 going to be released sometime soon and future releases focusing mainly on stability and compatibility with 3.1, as well as supporting all other architectures that are not yet working in 3.1. The whole reason I upgraded was because I wanted to make use of the python module support. I was previously using gmetric for monitoring things like PBS job count and temperature on my nodes. After a week or two of those scripts running, the load average on the systems started to climb. After a month, the load average increase caused by gmetrics was are 2-4 per host. A full 10% of my cluster's CPU utilization was caused by gmetrics alone (all system cpu). most likely the spawn/fork cost and the fact that they were done with too much frequency, 3.1 modules might be a good solution for that, but if the metric collection is expensive anyway (and I would assume it is as I have never seen that much consumption from my own gmetric which are executed every second) then you are not going to solve the problem by just moving that expensive operation into gmond. Carlo -- Crystal Reports - New Free Runtime and 30 Day Trial Check out the new simplified licensing option that enables unlimited royalty-free distribution of the report engine for externally facing server and web deployment. http://p.sf.net/sfu/businessobjects ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Unable to write root epilog / missing metric information
On Fri, May 08, 2009 at 12:15:19PM -0700, Jeff Orr wrote: Carlo Marcelo Arenas Belon wrote: On Thu, May 07, 2009 at 03:42:44PM -0700, Jeff Orr wrote: Upon the advice of Carlo, I updated the machine to Ganglia 3.0.7-1 with RRDtool rrdtool-1.2.30-1.rh9.rf (the furthest I could take FC5 without horrible dependency chasing). No dice. The CGI page renders correctly for about 1 minute after gmetad stop+start. Then blank page plus Root epilog error again. which CGI page?, any interesting messages in the error log for your webserver? This is the page provided by ganglia-web... /ganglia/index.php version 3.0.7-1 and all cluster specific pages work fine then?, it is only the grid one that is broken? is there any error in your web (apache) logs? outputs server_thread() received request /?filter=summary from 127.0.0.1 - you said it works fine for a while and then breaks, right? when it breaks the same query could be used to reproduce the issue? $ echo /?filter=summary | netcat localhost 8652 could you check the ouput of the gmetad request that is failing to see why/where it is getting aborted?, could it be there is some invalid character in the XML?, Unlikely, but unknown at this point. Firefox renders the XML all right, but that doesn't rule out illegal UTF-8 characters. to validate xml, sending the output of the previous netcat command to a file and running $ xmllint output.xml should return (echo $?) 0 if it is valid, and if not will pinpoint the source of the problem, and hopefully give us enough information to get a workaround and a fix. Carlo -- The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Help: libconfuse problem
On Fri, May 08, 2009 at 09:08:10PM +0800, Lee Amy wrote: Anyone have good idea on such problems? use the binaries from : http://sajino.sajinet.com.pe/ganglia/epel/EL-5/i386/ which are just ready to use rebuilt packages from fedora for Red Hat/CentOS 5 in x86. Carlo -- The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] 3.1.2 compile fails (3.1.1 compile o.k.)
On Fri, May 08, 2009 at 10:30:31AM -0400, Justin R. Davis wrote: OK, I finally figured out how to get 3.1.2 to compile. If I use the aclocal.m4 file from 3.1.1 in my 3.1.2 unmodified source, I am able to compile and run ganglia successfully. I was afraid this might be the problem based on your original report and since as I mentioned originally there were no code changes between 3.1.1 and 3.1.2 on the build code that could explain this. 3.1.2 was bootstrapped with newer/different versions of the autotools that are used for all releases (the ones provided by CentOS 4) though and that is where this aclocal.m4 comes from. another workaround (and the one I was testing in CentOS 5 x86_64) was to remove the 32bit version of libraries which conflicted : # rpm -ev expat-devel.i386 # rpm -ev apr-devel.i386 This made me wonder...Isn't this file supposed to be generated? So I deleted the aclocal.m4 file that is currently distrubted with 3.1.2 and tried building again...and viola! it works fine. When completed, I diff'ed the aclocal.m4 file which was created and it is now identical to the 3.1.1 version. for this to work you need to have automake installed, and will basically force a rebootstrapping of the package using it (hence also require libtool and the other packages required for bootstrapping like autoconf). instead of removing aclocal.m4 you can explicitally rebootstrap if that is the case by running : # autoreconf So if there are any developers out there...you might want to look into this... note taken by adding it to the release notes : http://ganglia.wiki.sourceforge.net/ganglia_release_notes won't be a problem for 3.1.3; promise. Carlo -- The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] 3.1.2 compile fails (3.1.1 compile o.k.)
On Thu, May 07, 2009 at 10:23:22AM -0400, Justin R. Davis wrote: When I try and compile 3.1.2 I run into a problem linking with the expat library. do you have both a 32bit expat (/usr/lib) and 64bit expat (/usr/lib64)? which distribution/version? Ganglia's libtool want to use the version in /usr/lib although I believe it should be using the version in /usr/lib64 that is the problem, ganglia's configure is sadly not that smart as it was only starting with 3.1.0 that it uses external library dependencies and for whatever reason your setup is getting it confused. If I try and compile 3.1.1 on the same computer, using the same configure arguments, it builds fine: this is therefore a regression, at least for your configuration, for that the output of your config.log might be needed, so feel free to send it directly (as it might be too big for the list), or attach it to a bug report with all other relevant information. since there shouldn't had been any changes between 3.1.1 and 3.1.2 that could trigger this AFAIK, would be also interesting to see what the working 3.1.1 log shows. if you could extract the relevant information for this thread, even better as that would help other people later that have your same problem. Any idea how this issue can be fixed in the 3.1.2 version? as a workaround you might be able to use : ./configure --libdir=/usr/lib64 Carlo -- The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Help: libconfuse problem
On Thu, May 07, 2009 at 10:52:35PM +0800, Lee Amy wrote: /usr/bin/ld: /usr/local/lib/libconfuse.a(confuse.o): relocation R_X86_64_32 against `a local symbol' can not be used when making a shared object; recompile with -fPIC /usr/local/lib/libconfuse.a: could not read symbols: Bad value you are trying to build a non-static version of ganglia linked against a statically compiled libconfuse, you could either : 1) rebuild your libconfuse as a dynamic library (off by default) confuse $ ./configure --enable-shared EPEL packages are dynamic and could be used instead of a custom build library as well in your setup AFAIK : http://download.fedora.redhat.com/pub/epel/5/i386/repoview/libconfuse.html 2) build ganglia statically (would also build all modules statically) : ganglia $ ./configure --enable-static-build Carlo -- The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Unable to write root epilog / missing metric information
On Thu, May 07, 2009 at 10:43:46AM -0700, Jeff Orr wrote: Carlo Marcelo Arenas Belon wrote: I'd suggest you upgrade to 3.0.7 compiled against a newer version of rrdtool to see if the problem goes away as well for you and if not we might need to reopen that bug and do more investigation on the source of it in any case you want to upgrade your gmetad, since it has a vulnerability as explained in : http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=223 so don't forget to also apply the following patch on top of it : http://bugzilla.ganglia.info/cgi-bin/bugzilla/attachment.cgi?id=189 So, should I patch 3.0.7 and upgrade, or is the upgrade to 3.0.7 sufficient? upgrade to 3.0.7 and apply the patch, this will be included in 3.0.8 when it is released and is also included (sorta) in 3.1.2. 3.0 and 3.1 are not compatible at the network/configuration layer and so you can't upgrade gmond to 3.1 if you are still using 3.0 in the same cluster and if you update all the gmond in a cluster to 3.1 you have to use a new configuration as described in the release notes. since gmond wasn't your problem here, I'd suggest you better rollback this update for now, and focus in the gmetad problem. It was my understanding that 3.1.x and 3.0.x could communicate, i.e. a 3.0.x cluster could talk to a master running 3.1.x. I distinctly remember it on the main page for configuring 3.1. Oh well... yes, a 3.1 gmetad can talk to a 3.0 gmond or a 3.1 gmond, as far as they are not part of the same cluster, my suggestion was just to simplify your setup and number of changes so that you could get a better grip on the problem. I'll try 3.0.7 with new RRDtool and see how it goes. 3.0 links RRDtool statically, so you don't really need an RRDtool package installed but only the static library (built from source and installed) in the machine where you are building your new gmetad. Carlo -- The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Unable to write root epilog / missing metric information
On Thu, May 07, 2009 at 03:42:44PM -0700, Jeff Orr wrote: Upon the advice of Carlo, I updated the machine to Ganglia 3.0.7-1 with RRDtool rrdtool-1.2.30-1.rh9.rf (the furthest I could take FC5 without horrible dependency chasing). No dice. The CGI page renders correctly for about 1 minute after gmetad stop+start. Then blank page plus Root epilog error again. which CGI page?, any interesting messages in the error log for your webserver? could you check the ouput of the gmetad request that is failing to see why/where it is getting aborted?, could it be there is some invalid character in the XML?, do you have anything else other than the web frontend talking to this gmetad?, if the frontend is stopped do any of the following commands reproduce the problem? $ telnet localhost 8651 $ echo / | netcat localhost 8652 Carlo -- The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Unable to write root epilog / missing metric information
On Wed, May 06, 2009 at 03:38:34PM -0700, Jeff Orr wrote: We were using 3.0.4.fc5, so I tried updating to 3.1.2 to see if it fixes the problem. this was reported before as a bug in 3.0 which magically went away somewhere in the last versions of 3.0 (it might had been because of using newer versions of rrdtool as well). http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=42 I'd suggest you upgrade to 3.0.7 compiled against a newer version of rrdtool to see if the problem goes away as well for you and if not we might need to reopen that bug and do more investigation on the source of it in any case you want to upgrade your gmetad, since it has a vulnerability as explained in : http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=223 so don't forget to also apply the following patch on top of it : http://bugzilla.ganglia.info/cgi-bin/bugzilla/attachment.cgi?id=189 Now in addition, we are getting these messages on gmond startup 3.0 and 3.1 are not compatible at the network/configuration layer and so you can't upgrade gmond to 3.1 if you are still using 3.0 in the same cluster and if you update all the gmond in a cluster to 3.1 you have to use a new configuration as described in the release notes. since gmond wasn't your problem here, I'd suggest you better rollback this update for now, and focus in the gmetad problem. Carlo -- The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Is there any documentantion about libganglia ??
On Tue, May 05, 2009 at 11:45:15AM -0300, ricardo figueiredo wrote: I would like know if there is some documentation about libganglia. no, and doesn't have an stable ABI either yet. Carlo -- The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your production scanning environment may not be a perfect world - but thanks to Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 Series Scanner you'll get full speed at 300 dpi even with all image processing features enabled. http://p.sf.net/sfu/kodak-com ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmond won't start at boot
On Sun, May 03, 2009 at 11:43:33PM -0300, David Chinellato wrote: should, of course. I looked at /var/log/daemon.log and found the following error message related to gmond: /usr/sbin/gmond[1956]: Error creating multicast server mcast_join=239.2.11.71 port=8649 mcast_if=NULL family='inet4'. Exiting. the problem here is that gmond is failing to start because it can't bind to the network, because it was started before it. adding to the INIT INFO section of your init script : # Required-Start: $network should add the needed dependency that upstart requires to know when to execute it. Carlo PS. untested as I have no jaunty available, but that is the syntax that IMHO is needed in hardy, for more information check with ubuntu, as this is an ubuntu specific problem -- Register Now Save for Velocity, the Web Performance Operations Conference from O'Reilly Media. Velocity features a full day of expert-led, hands-on workshops and two days of sessions from industry leaders in dedicated Performance Operations tracks. Use code vel09scf and Save an extra 15% before 5/3. http://p.sf.net/sfu/velocityconf ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmond won't start at boot
On Mon, May 04, 2009 at 03:50:12AM +, Carlo Marcelo Arenas Belon wrote: PS. untested as I have no jaunty available, but that is the syntax that IMHO is needed in hardy, for more information check with Ubuntu, as this is an Ubuntu specific problem spun up a jaunty amd64 VM and rebuilt the debian experimental package for 3.1.2 to see if the problem you reported was also happening with it. seem to be working just as expected at least for gmond (using dhcp in the network though). if interested binary packages are available from : http://sajino.sajinet.com.pe/ganglia/jaunty/amd64/ and the package for debian experimental (linked from our release page AFAIK) http://packages.debian.org/source/experimental/ganglia Carlo -- Register Now Save for Velocity, the Web Performance Operations Conference from O'Reilly Media. Velocity features a full day of expert-led, hands-on workshops and two days of sessions from industry leaders in dedicated Performance Operations tracks. Use code vel09scf and Save an extra 15% before 5/3. http://p.sf.net/sfu/velocityconf ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetad : Permission denied to rrd_rootdir
On Mon, Apr 13, 2009 at 04:00:05PM -0400, Jorge Medina wrote: rrd_rootdir /home/developer/opt/ganglia/var/rrds and I am getting the following error: Going to run as user nobody Please make sure that /home/developer/opt/ganglia/var/rrds exists: Permission denied Who should own the directory and/or what permissions need to be available in that directory? has to be own by nobody and that user has to be able to access it (+x) and write to it (+w) so that it can create a directory hierarchy showing your clusters/nodes/metrics. remember that to be able to access a directory you have to also be able to access all other directories from starting from / and that will basically require that the home directory for developer be publicly accessible in most cases. Currently it is owned by my user developer, I tried giving rwx pemission to everyone, but I still get the same message. if you want to run gmetad as user developer then you have to start gmetad as that used and disable setuid in /etc/ganglia/gmetad.conf : setuid off Carlo -- This SF.net email is sponsored by: High Quality Requirements in a Collaborative Environment. Download a free trial of Rational Requirements Composer Now! http://p.sf.net/sfu/www-ibm-com ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] leopard launchd plist
On Mon, Apr 06, 2009 at 01:43:40PM -0400, Evans, Ryan E wrote: If anyone is interested I have working plist files for starting both gmond and gmetad using launchd under OSX leopard (10.5.6). do you have a working build for Leopard?, which version of ganglia are you using since AFAIK 10.5 was broken because there was no more a public header for kvm as shown in : http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=168 New to the list and not sure if this has been covered already. not a MacOSX person myself, but I presume they are just test files, and if that is the case they could be added to the contrib directory with some instructions. Carlo -- This SF.net email is sponsored by: High Quality Requirements in a Collaborative Environment. Download a free trial of Rational Requirements Composer Now! http://p.sf.net/sfu/www-ibm-com ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Building Ganglia on Solaris 10 (amd64)
On Mon, Apr 06, 2009 at 06:02:45PM -0400, Jorge Medina wrote: At this point, I get the following errors: gcc -DHAVE_CONFIG_H -I. -I. -I.. -m64 -O3 -std=c99 -I/home/sdkadmin/opt/ganglia/apr/include/apr-1 -I.. -I../../lib -I../../include -m64 -O3 -std=c99 -I/home/sdkadmin/opt/ganglia/apr/include/apr-1 -Wall -DHAVE_STRERROR -MT metrics.lo -MD -MP -MF .deps/metrics.Tpo -c metrics.c -fPIC -DPIC -o .libs/metrics.o In file included from /usr/include/procfs.h:26, from metrics.c:62: /usr/include/sys/procfs.h:111: error: field `pr_action' has incomplete type /usr/include/sys/procfs.h:112: error: syntax error before stack_t /usr/include/sys/procfs.h:130: error: syntax error before '}' token /usr/include/sys/procfs.h:164: error: syntax error before lwpstatus_t this is a bug on Solaris headers triggered because of -std=c99 while building libmetrics, as a workaround you can build metrics.c without that and then use it to generate the static library that will be linked with gmond later. Carlo -- This SF.net email is sponsored by: High Quality Requirements in a Collaborative Environment. Download a free trial of Rational Requirements Composer Now! http://p.sf.net/sfu/www-ibm-com ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] multiple gmond daemons on one host
On Fri, Jan 09, 2009 at 08:00:13PM -0500, Jesse Alvarez wrote: Good question. My understanding is that you need one gmond as a collector per cluster. one or more if you need redundancy, but that gmond is in no way special (other than the fact that listens in TCP 8649 by default) so there is no need to have a collector server running that gmond, when any other gmond running in your nodes already could do it. Carlo -- Check out the new SourceForge.net Marketplace. It is the best place to buy or sell services for just about anything Open Source. http://p.sf.net/sfu/Xq1LFB ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Setting up Ganglia
On Fri, Nov 28, 2008 at 10:59:41AM +0200, Johann Spies wrote: /usr/local/sbin/gmond -t /etc/ganglia/gmond.conf presume you restarted gmond after changing the configuration sudo gstat -a CLUSTER INFORMATION Name: unspecified Hosts: 1 Gexec Hosts: 0 Dead Hosts: 0 Localtime: Fri Nov 28 10:55:13 2008 CLUSTER HOSTS Hostname LOAD CPU Gexec CPUs (Procs/Total) [ 1, 5, 15min] [ User, Nice, System, Idle, Wio] head001.sun.ac.za 4 (1/ 293) [ 0.94, 0.56, 0.76] [ 24.9, 0.0, 0.7, 73.2, 1.3] OFF With no information from comp001-comp021 although gmond is running on each one of them. if iptables is off in all of them and they are all (including head001) in the same network segment, then you have a network problem. # tcpdump host 239.2.11.71 should show you are getting multicast messages from all the nodes, but probably only shows packets from head001 instead. The reason for my previous mail was that the person described a solution in his/her situation using monocasting and not multicasting. I suspect multicasting is not working on my system. to change to unicast what you have to do is change the configuration from : /* Feel free to specify as many udp_send_channels as you like. Gmond used to only support having a single channel */ udp_send_channel { mcast_join = 239.2.11.71 port = 8649 ttl = 1 } /* You can specify as many udp_recv_channels as you like as well. */ udp_recv_channel { mcast_join = 239.2.11.71 port = 8649 bind = 239.2.11.71 } into (assuming all your nodes can resolve that name, otherwise use the ip) udp_send_channel { host = head001.sun.ac.za port = 8649 } udp_recv_channel { port = 8649 } then restart all your gmond with the new configuration (presume iptables is disabled in all the other nodes as well) Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetric fails when disk is unwriteable?
On Wed, Nov 26, 2008 at 08:40:55AM -0700, Brad Nicholes wrote: On 11/26/2008 at 3:45 AM, in message [EMAIL PROTECTED], Martin Knoblauch [EMAIL PROTECTED] wrote: From: Brad Nicholes [EMAIL PROTECTED] On 11/25/2008 at 10:14 AM, in message [EMAIL PROTECTED], Ofer Inbar wrote: Brad Nicholes wrote: It needs a temp directory to get around some issues with libconfuse. Libconfuse doesn't actually support wildcard paths or files. A libconfuse include statement must have a full path to the file that it is going include. So gmond makes up for this problem by creating a temp file, resolving all of the file paths and names and then writing them as separate includes in a temp file. Then it tells libconfuse to include the temp file directly. Without the ability to resolve the wildcard paths and write them to a temp file, the wildcarding feature of gmond wouldn't work. To solve the problem that you are describing, we would have to actually add wildcard capability to libconfuse. Might this be cleaner workaround that would work for gmond as well? - override libconfuse's include function as you're already doing - resolve file paths and names as you're already doing - instead of writing that to a temp file and telling libconfuse to include that file, just tell libconfuse to include each individual file (the same filenames you're now writing to the temp file) No, libconfuse doesn't work that way. The include handler can only manipulate the file path that it is handed. So the result of the handler has to be a single absolute file path. There isn't any way to take a single file path as input into the handler and return multiple file paths back to libconfuse. The only way to do it was to write all of the individual file paths to a file and then hand libconfuse back a single file path to the new include file. the question is: can't the handler be rewritten to the conversions in memory, without needing to write a temp file? This would make the process more robust. You never know when a disk is full, or goes RO. No, I tried doing that already but was unsuccessful. Libconfuse is limited in what you can do in this area. the API libconfuse exports is limited to handling single file includes (as documented) so it shouldn't be a surprise that it wouldn't handle a wildcard include with it. The problem is that when libconfuse wants to read in the include file, it is in the middle of the lexer and needs to continue. A handler can't just read the file and hand it back to libconfuse through some other cfg_* call. an alternative will be to preprocess the configuration file and feed it into a buffer in memory, resolving all includes, and then call libconfuse to parse and process the buffer instead. this would have also the nice side effect of preventing gmond/gmetric to segfault if there is no gmond.conf (hence using the embedded configuration) and there are files in the include path (as documented in the release notes since 3.1 for requiring gmond.conf if using modpython). This may be a design flaw in libconfuse but it is the way it works now and we have to live with it. since AFAIK no libconfuse developer was ever notified of their flaw it might be as well that our implementation is abusing their API. will check with them and update back with any suggestions. Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetric fails when disk is unwriteable?
On Tue, Nov 25, 2008 at 04:33:05PM -0700, Brad Nicholes wrote: The result was that if the wildcard produced more than 10 included files (which it easily does even in our default configuration), libconfuse choked because it thought it had hit the maximum nesting level our RPMs for ganglia only install 3 files in /etc/ganglia/conf.d; gentoo has 2 and fedora 10 (just released) has 4. even if I agree that 10 is somehow low and you would expect that as more modules are deployed it will be soon problematic, it would seem that at least in this case, one problem was traded for another. Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetric fails when disk is unwriteable?
On Mon, Nov 24, 2008 at 04:55:42PM -0700, Brad Nicholes wrote: On 11/24/2008 at 3:47 PM, in message [EMAIL PROTECTED], Ofer Inbar [EMAIL PROTECTED] wrote: I tried feeding one of my custom metrics by hand: [root ~]$ gmetric --name net_smtp_fin_wait2_out --value 0 --type uint8 --units 'connections' /etc/ganglia/gmond.conf:94: failed to determine the temp dir Parse error for '/etc/ganglia/gmond.conf' It needs a temp directory to get around some issues with libconfuse. gmond does; gmetric doesn't need anything more than to know which channel to use (hence nothing in the includes) and it is getting blocked by this restriction because of its use of libganglia to read gmond's configuration through libgmond. To solve the problem that you are describing, we would have to actually add wildcard capability to libconfuse. libconfuse is instructed to use our implementation for includes and that uses a temporary file, so this is fixable in our code. a fix to the problem reported by Ofer only needs our handler modified so that failures to create temporary files to handle includes are not treated as fatal as Committed revision 1922 Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetric fails when disk is unwriteable?
On Fri, Nov 21, 2008 at 11:33:05PM -0500, Ofer Inbar wrote: What's the dependency that causes gmetric to require that the filesystem the CWD is on be writeable? as explained by Brad it is not the CWD that needs to be writeable but a TMPDIR (which for root can also be the current directory) and that is detected by APR. Recent Linux (since around kernel 2.4.16) requires a ramdrive mounted in /dev/shm, so one way to workaround this problem is to define : TMPDIR=/dev/shm 3.0 gmetric is not affected and so could be also used as an alternative. Carlo PS. SysVinit workaround for gmond Committed revision 1923 - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Gmond Stops Sending Metric after Ntpd Adjusted System Time
On Fri, Nov 14, 2008 at 05:03:19PM +0800, ada wrote: After ntpd adjusts the system time from 8pm to 12am, gmond stops sending any metric out. With the help of tcpdump, I don't see gmond sending any udp message out. you mean your node clock was off and when ntpd started it was adjusted forward 4 hours, backward 8 hours or backward 20 hours? I guess there is some mechanism in gmond that assure gmond sending information certain time after last sending. gmond main job is to schedule metric collections, and for that it need a monotonically increasing time source. gmond will find that last sending time is in the future. In this way, it will wait several hours to send out information again. this is most likely true, and a bug if that is the case (I haven't seen a gmond recover for this though, so it might be even worst) in any case, keeping your time on sync in all your cluster is a prerequisite to having a good ganglia setup as the local time from the nodes is trusted in several places. Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] High system load when gmond is running
On Thu, Nov 13, 2008 at 03:08:46PM -0800, [EMAIL PROTECTED] wrote: I looked into it further and it looks like my problem isn't gmond its gmetad. gmetad is an IO intensive application and therefore will raise your load because of blocked processes if your disk can't keep up with it. the directory where your rrd are being stored (usually /var/lib/ganglia) should be able to handle high IOPS for all the updates to the RRD files for all the nodes you are collecting data from. in the worst case (and since the RRD are usually small and usually systems have more memory than they need) you could probably run it on a ramdrive. Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Help with HP-UX please
On Fri, Nov 14, 2008 at 07:59:48AM +, Lockwood, David wrote: I am looking for a how-to that is specific for compiling gmond for HP-UX 11.11 on PA-RISC please the last release of ganglia (3.1.1) won't be able to work in HP-UX but the older release from the maintenance branch (3.0.7) should compile cleanly, at least using gcc by doing : $ ./configure $ make $ make install sadly AFAIK no one of the developers have access to an HP-UX system and so providing a binary will be most likely impossible, if you find any issues and are willing to help troubleshoot them, it might be possible to get them resolved though. Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Help with HP-UX please
On Fri, Nov 14, 2008 at 09:01:18AM +, Lockwood, David wrote: Will a 3.0.7 gmond process be able to provide data for a 3.1.0 gmetad process? yes, as far as you keep it talking only with other 3.0 gmond (if using multicast, then you will need to use a different multicast IP for your 3.0 gmond, and use a different datasource in your gmetad to poll them). a 3.1 gmetad is able to read the XML generated by a 3.0 gmond, but the XDR protocol that gmond talk between them is incompatible between 3.0 and 3.1 I had issues with this on the Windows servers that I monitor and had to make sure that the versions were the same. the same applies to windows, but considering that 3.1 has far better support for windows metrics than 3.0 it is probably a good idea to upgrade them to 3.1 anyway. Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] BOF at LISA?
On Fri, Nov 07, 2008 at 03:46:41PM -0800, Bernard Li wrote: On Fri, Nov 7, 2008 at 3:38 PM, Ofer Inbar [EMAIL PROTECTED] wrote: I don't know that most Ganglia users who are attending LISA are also on this email list - I would expect the number who aren't on this list is greater than the number who are. If you have a good idea on how to reach out to them in the next few days, I'm all ears. from what I recall, BOFs are pretty easy to setup; you can just follow the instructions from : http://www.usenix.org/events/lisa08/bofs.html most of the people interested in monitoring might be already aligned for the BOFs in the Hampton room for Wednesday anyway and the BOF schedule is published to all participants at registration as well. Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Gmond losing data?
On Thu, Oct 30, 2008 at 12:18:07PM -0400, Lars Kellogg-Stedman wrote: Hello all, I'm experiencing some odd problems with Ganglia (3.1.1, under CentOS 5). Sometimes, gmond stops collecting data from remote hosts. If you are using unicast and you restarted your collector (the gmond that gmetad is pulling from) then you are running into a known issue with the way that metadata is getting updated mentioned in the release notes : http://ganglia.wiki.sourceforge.net/ganglia_release_notes The suggested configuration is to set send_metadata_interval to a non zero value. Gmond merrily ignores the data. It seems that restarting the *local* gmond does not correct the problem, but restarting the *remote* (sending) gmond does make things start working again. because the metadata is resent. Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Monitoring NFS share disk usage
On Tue, Oct 28, 2008 at 05:30:25PM +1100, Adam Mitchell wrote: #!/bin/bash VALUE=$(df /home/ | grep /home |awk '{print $3 }') gmetric --name disk_nfs_used --value $VALUE --type uint32 --units Bytes not relevant for your problem but units here should be KB gmond is running on the head node. However, there doesn't seem to be any rrd's being produced. the rrd are generated by gmetad, which in turns reads them from your gmond. for that to work on your setup you need to configure (/etc/gmond.conf) the gmond in your head node with the same cluster name (shiva) and collector than your work nodes, and then confirm that your gmetad is configured (/etc/gmetad.conf) to pull the status for your cluster (shiva) including your new metric. telnet to port 8649 in your collector (any gmond if using multicast, or the one that your gmetad is pointing to if using unicast) should dump an XML description of your cluster and include the new metric you just created with gmetric inside a host definition from your head node. I have added the flowing lines to the cluster_view.tpl IMG HEIGHT=147 WIDTH=395 ALT={cluster} DISK SRC=./graph.php?c=shivaamp;h=shiva.edag.clusteramp;v=233.904amp;m=disk_nfs_usedamp;r=houramp;z=mediumamp;jr=amp;js=amp;st=1225161877vl=GB /TD v, st, z, r are better pulled from the environment as you will be otherwise hardcoding some of the values for your graph. since you are trying to import a metric graph in a cluster view, that might not work correctly anyway and so changes to graph.php might be needed too. Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Ganglia support in AIX
On Mon, Oct 20, 2008 at 09:17:08AM -0400, Sai p Seshasayee wrote: We have been using Ganglia to monitor our cluster on Linux and now we are planning to use Ganglia to monitor our AIX clusters. I have visited the Ganglia website and found no rpms for AIX (version 3.1.1). 3.1.1 runs ins AIX but has no way yet to use DSO metrics (it could probably run python metrics though) and because of that 3.1.1 is almost equivalent to 3.0.7 (except that that some bugs were fixed and others introduced and they have a different unit for the disk metrics) I am just interested to know if Ganglia is supported in AIX and whether you are planning to build rpms for AIX? As pointed before Michael has RPMs for the 3.0 version and he is probably working on getting also some for 3.1 but you could probably use the README.AIX and other documentation to generate your own if needed. Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Deploying a gmetric collection
On Sun, Oct 19, 2008 at 02:46:55PM -0700, Ed Greenberg wrote: I looked at the source and see that the GAUGE/COUNTER choice is a Ganglia 3.1 feature. It's not present in the 3.0 code. So I'm going to give 3.1.1 a try. beware that 3.0 and 3.1 are NOT compatible at the XDR level and therefore you can't mix 3.0 and 3.1 gmond in the same cluster (as defined by the multicast or unicast address used for a collector). for more details on how to upgrade check the release notes in : http://ganglia.wiki.sourceforge.net/ganglia_release_notes Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] adding custom graphs to ganglia-web view
On Mon, Sep 15, 2008 at 02:29:37PM -0400, Ofer Inbar wrote: I can add the names of my custom graphs to get_context.php, and then they show up in the menu at the top of the cluster view that lets me select what graph to show for each node at the bottom. why not add them to $optional_graphs in web/conf.php instead? I agree with you that at the end it might make more sense to treat all the reporting graphs the same regardless of if they are custom or part of the core just like metrics are treated now in 3.1 However, it'd also be really useful to: 1. See the custom graphs relevant to a cluster show up in the cluster view at the top, summarizing or totaling the whole cluster agree, but you are going to have to be able to configure which reports you want to have per cluster (as you will most likely need different reports in each cluster) otherwise you can just change the template (web/templates/default/cluster_view.tpl) to show the list of reports you are interested for all cluster pages (in that case is probably a good idea to create your own template name and refer to it in conf.php) 2. See those graphs in the grid view, in each cluster's summary if it is the same list for all clusters can be done in the template (web/templates/default/meta_view.tpl), if you want a different list per cluster then you could probably use the same configuration defined above, but could get difficult to handle and to understand as space is limited in the grid view per cluster and is IMHO useful to have all cluster reporting the same metric vertically so you can see global trends easily. 3. For each node, have it default to showing the graphs we deem most relevant to its cluster, in the first section of the node view OK, you will also need to know how many reports to show to keep the layout useful, as again the layout will need to adjust dynamically. Are there easy ways to do this with the existing ganglia-web package? no, because there is no per cluster configuration where this information could be stored yet, but as I mentioned before, tweaking the templates could help. Is anyone working on making this possible? in trunk there has been changes to the grid view to have the four reports from cluster view and in the same order (plus adding zoom) as you can see from BUG184: http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=184 as part of that implementation there has been some discussion about doing something similar to what you proposed, but no code has been produced AFAIK (CC all probably interested parties in case code was produced but not yet made public) Note: I'll probably send some of my custom metric packages graphs for inclusion in contrib, but I'll be much more motivated to polish them up for public consumption if I know people could use them easily. ganglia development is always open, so forward your patches and suggestions through bugzilla or the ganglia-developers list, which will be where this could be accomplished. Carlo PS. moving thread to ganglia-developers to followup with development suggestions and code. replying might require subscribing to that list first - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmond stops collecting client metrics
On Thu, Sep 11, 2008 at 01:02:10PM -0500, Martin Hicks wrote: Hi, ganglia-3.1.1 on x86_64 linux. I just upgraded my RPMS to 3.1.1 this morning to see if there were any fixes for this problem. I've put up configs, XML and a PNG here: http://bork.org/~mort/ganglia/ The node where this picture is taken has two blades reporting to it. When gmond restarts on this node then the blades no longer seem to successfully report metrics. this is a known problem from 3.1 when using unicast as you have in your setup (the last bullet point in Important Notes) http://ganglia.wiki.sourceforge.net/ganglia_release_notes to workaround it you have to add a non-negative value for send_metadata_interval to your nodes. I went back to gmond-3.0.7 and the problem disappears. 3.0 won't ever had this problem as there is no metadata handling there Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Help: nobody user problem
On Mon, Sep 08, 2008 at 04:17:20PM -0700, Bernard Li wrote: On Mon, Sep 8, 2008 at 4:12 PM, Lee Amy [EMAIL PROTECTED] wrote: Thank you all. My cluster uses Cent OS 4. And your description is really clear, thank you very much! In the future you probably want to hit 'reply-all' when replying so it goes to the mailing-list as well. The email you sent only got to me. also, for some more interactive help, you might want to use the IRC channel #ganglia in freenode as described in : http://ganglia.info/?page_id=68 Bernard, from the core developers, hangs out there with frequency as well as other users/developers of ganglia which might be able to help as well. Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmond 3.1 not reporting host data
On Mon, Sep 08, 2008 at 01:42:16PM -0500, Ryan Robertson wrote: I too am having trouble getting the gmond collector report data of itself. presume that you are referring to some other report from ganglia 3.1 not being able to get its own data here based on the subject, but the behaviour observed is from 3.0 based on the body, could you provide a reference to the original report as it might be an unrelated problem. running `gmond -d10` should generate a log of what is going on that could help trace the problem, but from the configuration shown below it might be just an unintended misconfiguration. I've tried mulitple variations on the gmond.conf, but can't seem to find a combination that works. This is on power5 AIX 6.1 running ganglia-gmond-3.0.7-1. /* Feel free to specify as many udp_send_channels as you like. Gmond used to only support having a single channel */ udp_send_channel { host = 127.0.0.1 port = 8649 ttl = 1 } /* You can specify as many udp_recv_channels as you like as well. */ udp_recv_channel { mcast_join = 239.2.11.71 port = 8649 // bind = 239.2.11.71 bind = 10.50.54.31 } /* You can specify as many tcp_accept_channels as you like to share an xml description of the state of the cluster */ tcp_accept_channel { port = 8649 } you have to match the mode used (multicast or unicast) for udp_send_channel and udp_recv_channel even if we allow the configuration of mismatching sets which will result in problems like the one you are observing (that is a bug) but mainly because the flexibility of the configuration allows for some strange settings that we would be otherwise not able to predict (like having additional unicast messages sent somewhere different than a gmond for reporting) the following should work in your case : * plain multicast configuration as used by default (you need multicast support working for your system and enabled/routed correctly) udp_send_channel { mcast_join = 239.2.11.71 port = 8649 ttl = 1 } udp_recv_channel { mcast_join = 239.2.11.71 port = 8649 } * plain unicast configuration through localhost (not what you want in the long run as it will use localhost as the node name) udp_send_channel { host = 127.0.0.1 port = 8649 } udp_recv_channel { port = 8649 } * plain unicast configuration through working interface (assuming that 10.50.54.31 is configured in one of your interfaces) udp_send_channel { host = 10.50.54.31 port = 8649 } udp_recv_channel { port = 8649 } Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] [ANNOUNCEMENT] Official release of Ganglia 3.1.1
On Tue, Sep 09, 2008 at 01:53:43PM -0400, Ofer Inbar wrote: Some questions regarding upgrading: was going to realign the release_page to make that a little more obvious but it was a little late to do any major changes to it, which is why it is not explicitly there. 1. Can gmond 3.1.1 nodes coexist compatibly in the same cluster with gmond 3.1.0 nodes? 2. Can a gmetad 3.1.1 use gmond 3.1.0 nodes as data sources? Can a gmetad 3.1.0 use gmond 3.1.1 nodes as data sources? yes, 3.1.0 and 3.1.1 are completely compatible and if you are using 3.1.0 you are encouraged to move to 3.1.1 ASAP (the 3.1.0 package from both Gentoo and Fedora include some of the critical fixes in 3.1.1 already, for example) Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmond 3.1 not reporting host data
On Tue, Sep 09, 2008 at 01:07:10PM -0500, Ryan Robertson wrote: My goal was to have multiple nodes reporting to a central location (10.50.54.31) also running gmond and reporting info on itself as well. then you need all gmond configured with the same cluster name and setup to use unicast with 10.50.54.31 as the collector. the only one that should need to have tcp_accept_channel so you can query it in TCP/8649 (what gmetad will do) will be the collector. To accomplish this, wouldn't I configure the clients that will be sending data something to this effect: -- /* Feel free to specify as many udp_send_channels as you like. Gmond used to only support having a single channel */ udp_send_channel { host = 10.50.54.31 port = 8649 ttl = 1 mcast_if = en2 } -- no you are missing unicast and multicast in that configuration (which probably shouldn't be allowed to work but does anyway as explained before). you need instead : udp_send_channel { host = 10.50.54.31 port = 8649 } The 10.50.54.31 was the original gmond.conf posted. because this one has to collect all data send to him from the other nodes it also needs : udp_recv_channel { port = 8649 } and because it is the only one that has all the nodes information and will be queried from gmetad from it it should also have : tcp_accept_channel { port = 8649 timeout = -1 } I'm assuming the gmetad server (10.50.54.48) reaches out to the nodes defined in gmetad.conf for information. in a unicast configuration it should only point to the collector gmond, because that is the only one collecting all metrics from the cluster with something like : data_source ganglia 10.50.54.31 where ganglia hopefully describes the use of your cluster and matches what you configured in your gmond.conf as the cluster name with something like : cluster { name = ganglia } Is there a method in which they only are aware of their own information? not sure what you mean here, but if you use each node as its own collector then will only have their own information, but also by definition they will be also their own independent cluster (even if the cluster name is the same on all of them) and you will have no way to collect them in a single view in the frontend, because gmetad doesn't aggregate data from multiple sources. Lastly, I have a few nodes where HACMP is in place using IP Aliasing on a single interface. In these cases, i need to bind gmond.conf to a particular IP. if you are using unicast you won't need to do that, if you are using multicast you will need to stay with 3.0.7 or wait for 3.1.2 to be able to use mcast_if, or change your routing table to ensure that multicast is routed through the virtual interface. also remember that you can't mix 3.0 and 3.1 gmond in the same cluster so you have to use the same version in all nodes (including the collector) look at the previous reply with examples and gmond.conf(5) for more details on configuring ganglia. Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Help: How to add hosts?
On Sun, Sep 07, 2008 at 01:22:05PM +0800, Lee Amy wrote: How to add hosts to Ganglia system? Foe example, I have 5 nodes, one master and four slaves, what if I want to see their system performance, how could I add hosts? In which configuration file? if you install gmond in all 5 nodes and also gmetad and the web application in the master node and start them all you should be able to see their system performance. it is important that you familiarize yourself on how ganglia works and how to configure each piece and for that the following wiki could help (even if it is a little more oriented at running ganglia in IBM hardware) http://www.ibm.com/developerworks/wikis/display/WikiPtype/ganglia be sure also of reading all README and man pages that came with the software so that you can tune it to your environment Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Help: APR problem
On Sun, Sep 07, 2008 at 12:02:59PM +0800, Lee Amy wrote: Hi, I'm a newbie in Ganglia, I've installed apr correctly, the location is /usr/local/apr by default. But when I run make to compile the Ganglia, it shows following error messages and terminated. In file included from scoreboard.c:7: gm_scoreboard.h:4:23: apr_pools.h: No such file or directory scoreboard.c:8:22: apr_hash.h: No such file or directory scoreboard.c:9:25: apr_strings.h: No such file or directory make[2]: *** [scoreboard.lo] Error 1 make[2]: Leaving directory `/root/tmp/ganglia-3.1.0/lib' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/root/tmp/ganglia-3.1.0' make: *** [all] Error 2 But I check the /usr/local/apr/include directory, the directory contains the necessary header files. Could you tell me how to fix this problem? before running `make` you have to configure your source so it will be able to find all the dependencies that are needed and when you do that you have to tell it where is that you installed APR by using something like : $ ./configure --with-libapr=/usr/local/apr/bin/apr-1-config you might also need the location for other libraries like libconfuse or expat if they were not already installed. Carlo Thank you very much~ Best Regards, Amy Lee - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetad bug when gmond host hangs
On Tue, Sep 02, 2008 at 12:54:07PM -0400, Ofer Inbar wrote: Brad Nicholes [EMAIL PROTECTED] wrote: Thanks Carlo, this is some good feedback. I know that both Bernard and Cos have reported having issues with this bug. Could either (or both) of you independently confirm that this patch fixes the problem? To reproduce this bug, I'd need a host in a state where it accepts TCP connections but then leaves them hung, which is not something I want to do on any of my production hosts, it shouldn't be a problem at all if your failover sources are setup correctly anyway. you don't need to crash the machine, but just stop the gmond process by running something like : # kill -STOP `pidof gmond` to fix it after you are done you can do : # kill -CONT `pidof gmond` you will need a patched gmetad though, but doesn't need to be the same you have in production either, even if I'd expect you to roll it there quickly if this problem is really a showstopper for your 3.1 production deployment as Brad seemed to think. If anyone out there on the list has a way to set up a Ganglia testing cluster and then deliberately put one of the data sources in his state, wanna test out this patch? that is what I did, but I have to admit that my test environment was tiny as I only used 1 linux box (my gentoo linux workstation) and 1 windows box (a windows vista box where I build my windows ganglia binaries) configured together in one single cluster running 3.1 (the failover source wasn't setup correctly though as I don't have a way to synchronize the clocks between them both, and they are in different VLANs and my little linksys switch can't do multicast routing) Brad is probable looking for someone else to come out with a more realistic production like test, but if no one can do that, I might be able to configure it by moving around some cables and trying to setup a more realistic failover scenario (running linux in the windows box) even if that probably defeats the indepent confirmation part of the testing request. Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] gmetad bug when gmond host hangs
On Sat, Aug 30, 2008 at 10:09:02AM -0600, Brad Nicholes wrote: On 8/30/2008 at 12:25 AM, in message [EMAIL PROTECTED], Carlo Marcelo Arenas Belon [EMAIL PROTECTED] wrote: On Fri, Aug 29, 2008 at 02:40:00PM -0400, Ofer Inbar wrote: Should this have made it into 3.1, or 3.1.1? It doesn't look like it. There is a fix in trunk now with r1738 and unless something goes wrong with it, will be most likely released with 3.1.2 and 3.0.8. The proposed backport patch for 3.1.x has been updated in the BUG and officially requested for inclusion in 3.1 (beware it includes 1 extra unrelated change that is needed to prevent future conflicts for backporting but that is otherwise mostly irrelevant, specially if making your own package) but also additional changes that simplify the logic and avoid a possible failure of logic which could result in gmetad crashing, so using this newer version is encouraged : http://bugzilla.ganglia.info/cgi-bin/bugzilla/attachment.cgi?id=165action=view 3.1.1 is already in testing and since this bug is not a showstopper for that specific release, I'd be surprised if the release manager decides it should be backported to it, but that shouldn't prevent you patching your own package with the proposed patch if you don't want to wait. If we are confident enough that the patch for this bug solves the problem I am fairly confident that the patch resolves the problem reported in BUG92 and that matches the description of the problem from Cos, and that is easy to replicate by just sending a SIGSTOP to gmond; otherwise I wouldn't have proposed it here, in bugzilla and the STATUS file. then I have no problem scraping 3.1.1, tagging and rolling 3.1.2 and restarting the testing period. The whole point of the testing period is to flush out problems like this and then determine if the fix is important enough to retag and retest. agree, but doing so will delay releasing the next version of 3.1 and also indirectly (as I won't start that until 3.1 is out to avoid confusion and overstressing our limited testing resources) the next release for 3.0. 3.1.1 includes a fix for a showstopper bug (gmetad crash in hierarchical configuration) and a fix for a high bug (instability with tcpconn.py) which had been already backported in fedora and gentoo distribution packages, but not debian, our own packages or anyone using 3.1 from sources (all other architectures where there are yet no packages available) and that are therefore waiting for a 3.1.1 release. We need some feedback on how serious this problem is for 3.0 will require that you either have another problem as well like : * gmond will hang (either because of another gmond bug or because of a misconfiguration of gmond that has too many or too slow consumers for the number of tcp_accept_channel defined and their configured timeout values and that should most likely affect all the sources as well) * system will hang (either a kernel level deadlock, some security/network guy playing games on the cluster or bad hardware) * a gmond becoming unresponsive because it is collecting data from to many nodes or getting swapped out (this should be moved to another system or segmented further) 3.1 has the same source of problems, except that a misconfiguration in gmond will be most likely to fail because of the increase in the XML payload but only if ganglia is misconfigured (your gmond should be queried from the local LAN from a local gmetad when possible, with hierarchical gmetads able to extend the setup over WANs, or if absolutely needed to be queried over a WAN, a timeout should be configured for the tcp_accept_channel, with -1, which means no timeout and is sometimes recommended for extreme cases). how confident we are in the fix. the fix was designed as the minimally intrusive change that will accomplish the desired objective without reverting to pre BUG27 scenario, and I expect it to evolve further in the near future as there are still open issues that will need to be addressed around that code like (this can get a little technical and will be better fitted for the ganglia-developer list but is added in here to complete the explanation under this context, future discussion should better moved to the ganglia-developers list): * it is not implemented for gmetad-python yet * it is not consistently implemented as not all failures will skip a source but instead will default to scanning all sources which was the problem that BUG27 was meant to fix as that could generate dips. * it overloads using the last_good_index identifier eventhough it doesn't really match the original intention for it as we haven't yet confirmed that the source is working. * I suspect the use of the dead identifier is misused (which explains the never ending failure messages that gmetad does) and therefore some refactoring around that code might be needed (which could include adding some other features) * I suspect
Re: [Ganglia-general] gmetad bug when gmond host hangs
On Sat, Aug 30, 2008 at 11:00:40AM -0400, Ofer Inbar wrote: Carlo Marcelo Arenas Belon [EMAIL PROTECTED] wrote: | --- Additional Comment #2 From Timothy Witham 2008-06-05 10:38 | My patch still loses if we are talking to a gmond affected by Bug#38. | In that case, we receive incomplete data, but since it is some data, | we keep talking to that host every time. Maybe we should just talk to | a random host every time. Better to fix Bug#38 though... That would have the effect of gmetad getting incomplete, old data from that gmond, but that seems to be a different problem. yes, and that is why it has a different bug. What bug is that? the part with the bug number was snipped out of this email, but will be reintroduced again below. I'm not talking about fixing gmond, I'm talking about having gmetad sense when gmond from one meber of the cluster is giving old incomplete data, and trying another cluster member to see if it can get newer data. I didn't see a bug for that, I just saw a note in the timeout poll bug mentioned that a solution for it wouldn't handle that situation, and saying that's okay, that *should* be separate. fair, and you are right I should had probably put a comment there with the reference to the enhancement bug as well, but since I put it in the email, I thought it wasn't needed anyway since most interesting parties were hopefully reading this thread or CC in the new bug. It has been updated now anyway. Is there indeed a bug, or a plan by someone, to get gmetad to do this? basically the enhancement request to do some sort of intelligent load balancing between sources that is now being tracked in BUG208 http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=208 Told solve it in gmetad, we'd want gmetad to have some way of judging whether the data it's getting from a gmond is fresh and current, which is not the same as judging whether it actually *got* the data from the gmond. the problematic code was introduced as a fix for BUG27 and was indeed trying to detect if gmond was able to use the source or not by looking at the obvious lack of TCP connectivity. BUG92 showed that the heuristic was incomplete because didn't include gmond/system that are hung but still responsive to a TCP three way handshake That was the *original* discussion on this thread, and that is indeed what the timeout poll bug is trying to address. What I'm saying is in reference to Timothy Witham's last note, which is a rather different matter. I was trying to explain how the code evolved and how this problem was introduced as well as the proposed solutions to hopefully come forward in the next 2 years of development cycle ;) Previous to BUG27 (around 2.5.7 or the first releases of 3.0) gmetad did failover to all sources sequentially until getting valid data, problem of course (as explained in that bug) is that when the first source for your datasource was down, then there was an increased delay that resulted in gaps as the time needed to get data from that datasource increased proportionally to the number of sources that were down at the beginning of the datasource (which can get even worst when some security guy decides to install a firewall that drops packets when misconfigured) Timothy's solution to BUG92 somehow reintroduces that failover mechanism by just choosing the source randomly but still doesn't address the healthcheck issue which results in sometimes selecting the bad source and therefore results in more gaps. BUG208 will hopefully address all the current issues by implementing a real load balancing solution for sources which also does health checking of some sort (still to be designed) to ensure that bad sources aren't used and marked down or in some cases correcting the result (as when accounting for time drifts between sources) So I'm confused by your response - not that anything you say is confusing, I'm just confused by how it relates to what you responded to? don't worry, I am used to have people confused with my responses, and I sometimes get myself in long email threads repeating the same concept again and again in different ways just because of that. but as Cassandra said recently the trick is in reading and re-reading the message until it finally clicks or as you did, ask for some clarification, at least until I can figure out how to use punctuation correctly because learning English from BASIC left me with some strange aversion to semicolons and to use dashes inside words. Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia
Re: [Ganglia-general] gmetrics not showing up in cluster-context metric list
On Fri, Aug 29, 2008 at 07:43:08PM -0700, Brad Fino wrote: Gmetric data is being reported. If I go to the specific nodes context, any info piped through gmetric is collected and reported correctly. good, your data is still there and you could get it reported by manually changing the m parameter in the url to whatever metric you are interested on looking at. However, in the cluster_metrics drop down list I can only get Ganglia's default metrics displayed. this list is generated dynamically and uses the first host of the cluster assuming that if it is there it is in all other nodes and that all nodes have the same set of metrics. this usually works fine, even if it is definitely buggy and will be a source of surprises now that 3.1 encourages custom metrics which could be missing from the list or be in the list and selectable even if the node doesn't have it (generating a broken link) Using 3.0.7 ... this was working for months until I had to reboot the server and then when gmond and gmetad came back up, my custom gmetrics in the drop down disappeared. check why the fist host in your cluster output (telnet 127.0.0.1 8651 in your gmetad) doesn't have all the metrics defined (maybe gmetric is not being called there anymore after the reboot) or restart your gmond until the first host has them all. Any ideas ? don't reboot your clusters ;) Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] ganglia on Solaris
On Thu, Aug 28, 2008 at 11:57:34AM -0700, W S wrote: Hi, Does anyone knows where Ganglia reading it's data from (expecially Disk I/O stats) on Solaris? Is it somewhere under /proc or just iostat output? it is reading it from the Solaris kernel through a public reporting interface kwnon as kstat http://developers.sun.com/solaris/articles/kstatc.html reading kernel statistics through /proc is only done in Linux (as that is the public reporting interface for it) and there is no implementation for IO metrics as part of the core metrics even if there are several addons using gmetric or 3.1 ganglia modules (in C or python) that has been developed to do so. Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] show hostnames only?
On Tue, Aug 26, 2008 at 10:33:48PM -0400, Jesse Becker wrote: On Mon, Aug 25, 2008 at 08:09, Carlo Marcelo Arenas Belon [EMAIL PROTECTED] wrote: It has been proposed for backporting to 3.1.x--just need one more person to sign off on it. it was proposed by me for backporting Aug 10th and has always my vote (only 2 are needed) so if you were OK with your original implementation it should had been ready for backport. Actually, I made several updates to the 3.1.x STATUS file in r1716, which you then removed in r1721. Jesse, Let me clarify my statement then with some clear timeline to avoid misunderstandings : Jun 23th: Raised the problem as part of testing a backport proposal that reverted the title/subtitle of the graphs to the style used in 3.0.x to workaround very long subtitles that were clovered Jun 24th: You develop r1460 and commit it to trunk Jun 24th: I propose some changes and provide a patch with enhancements including a backport for 3.1.0 Jun 24th: Bernard says patch isn't ready and you agreed on working on a fix based on some private discussion that wasn't summarized (*) Jun 27th: I provide a fix for the only known open issue after a couple of days debate that show patch intention was misunderstood Jul 15th: 3.1.0 is released for testing without this patch Jul 30th: 3.1.0 officially released Aug 10th: I add a backport proposal so it will be hopefully discussed and released with 3.1.1, because even if there are any issues (which are not yet known or communicated and except for the minor formatting issue with a proposed fix since Jun 27th) is better than nothing Aug 15th: Brad proposes a freeze and all proposals (including this one) are punted (with your explicit approval) to 3.1.2 Aug 22th: You vote this proposal for backport (too late for 3.1.1 though) Aug 25th: I commit enhancements to it that hopefully covers all issues (it does cover all known issues at least because avoids the only known issue that the Jun 27th proposal was meant to fix and syncs the modular graph code with what we have in 3.1 to easy backporting) and restarted voting for the enhanced solution instead so it will be hopefully part of 3.1.2 Aug 25th: 3.1.1 is released for testing that means that as soon as 3.1.1 is released which should be in a couple of weeks and 3.1.2 is open for development you should be able to commit this backport if you agree with the currently proposed solution as there is no need to wait for another vote (as 2 is all that is needed and mine is already there and was always there). of course feel free to improve upon it and restart the backporting process if needed or to split the proposal into different pieces (one that has already two votes and another one that has only mine for now and that builds upon it) or even to throw it all away and come out with a more elegant solution; after all it is your code and the frontend is your area of expertise and you clearly understand what issue needs to be solved with it. count on me to review whatever is proposed at the end so that this fix can get to our users in some future release as IMHO a bugfix or useful enhancement like this one (which is also in our wishlist from before 3.1.0) shouldn't be kept out from them just because of procedural issues. and before someone else misunderstands that last sentence as some kind of political statement or ranting let me clarify I don't mean to argue about procedure here or the validity of it but just that I don't think it should be used as an argument or justification for not delivering solutions to the ganglia community as it seemed to be implied by your original comment. Carlo (*) http://www.mail-archive.com/[EMAIL PROTECTED]/msg04355.html - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
[Ganglia-general] using an arbitrary UUID instead of FQDN for storing metrics (was Re: show hostnames only?)
On Tue, Aug 26, 2008 at 09:56:21PM -0400, Ofer Inbar wrote: I see some discussion of moving away from using hostnames as directory names to store RRDs in. I think that is a misunderstanding from the proposed enhancement of making ganglia a little more resilient to DNS failures or mismatched multihoming interfaces and that was the original driver behind the original wishlist. I hope very much that doesn't happen! Agree as the current KISS scheme is scalable (to a certain extent) easy to manage and intuitive and a key differentiator with other solutions that weren't as lucky as us to have Matt doing the original design. If it ever happens it should be most likely considered only as an option while the current scheme will be supported forever. The possibility to implement a different scheme than the current one that uses an ondisk tree of clusters/nodes is available (even if probably not yet complete and missing any needed frontend logic to support it) from our development version (trunk or 3.2.x) through the python gmetad rewrite and its modular architecture. It is very very useful to be able to ls the rrd directories to see what hosts gmetad sees, which ones it thinks are in which cluster; to run finds or ls's to see the datestamps on RRD files and immediately connect them to hosts; and to be able to rename hosts and rename their corresponding directories easily. Agree and from what I'd seen scales well to even thousands of nodes if enough IO is provided either through very fast disks, multiple mount points, RAIDs, SAN or even hacky alternatives like a file based loopback device or ramfs. Some people might find it more useful (even if it will be prohibitively more expensive) to store the metrics in some sort of DBMS and use an SQL interface instead of simple filesystem commands, or will have problems in their setup that prevents them from having a unique FQDN generating conflicts and so this is just an option for them. Don't worry that the core principles for ganglia of Simplicity, Efficiency and Correctness are still valid; after all this all started as a way to monitor HPC clusters in a global scale and we are still aiming for world domination last time I checked ;). Carlo - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100url=/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general