Re: [Ganglia-general] [Ganglia-developers] [ANNOUNCEMENT] Ganglia meetup Tue Oct 21 in San Francisco (Quantcast HQ)

2014-10-21 Thread Carlo Marcelo Arenas Belon
Also forgot to mention we have a code for FREE parking thanks to 
our friends of zirx[1], so if you are driving to SF for this 
meeting (like I am planing to do) all you really need is to install 
their app, hit the right address for quantcast in the map and hit
the price to enter your code: GANGLIA so someone will be waiting 
for you at the door and take your car to a safe place.

see you all in the other side, and lets have fun

Carlo

[1] http://zirx.com/

PS. you need an iphone or android phone to use their app though

--
Comprehensive Server Monitoring with Site24x7.
Monitor 10 servers for $9/Month.
Get alerted through email, SMS, voice calls or mobile push notifications.
Take corrective actions from your mobile device.
http://p.sf.net/sfu/Zoho
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Having to restart gmond on sender nodes if a collector node restarts

2012-05-04 Thread Carlo Marcelo Arenas Belon
if using Unicast, and ganglia 3.1.0 or higher, you MUST set this value to 
whatever value is the minimum you can tolerate (in seconds) for a restart 
to recover itself automatically like 3.0 (and lower) used to be able to do.

this feature is documented at the bottom (the fifth bullet point under 
Important Notes) of the following (no longer maintained) page :

  http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_release_notes

Carlo

On Fri, May 04, 2012 at 02:30:31PM -0700, Gilles Devaux wrote:
 Have you tried setting
 
 send_metadata_interval = Something
 
 in the gmond.conf (globals section) ?
 
 On Fri, May 4, 2012 at 1:58 PM, Pronko, Eric pron...@upmc.edu wrote:
  I'd be interested in learning anything you hear back if you don't mind.
 
  It's not the worst thing but if it goes unnoticed it could be problematic.
 
  Thank you
  
  From: Jochen Hein [joc...@jochen.org]
  Sent: Friday, May 04, 2012 4:23 PM
  To: Pronko, Eric
  Cc: ganglia-general@lists.sourceforge.net
  Subject: Re: [Ganglia-general] Having to restart gmond on sender nodes if a 
  collector node restarts
 
  Pronko, Eric pron...@upmc.edu writes:
 
  I have run into a situation where I restart a GMOND collector node,
  and then all sending node data isn't received until I also restart
  GMOND on those nodes. ?Is this intentional - is there any way to work
  around this so that I can safely reboot a collector node without
  having restart GMOND on all senders?
 
  I have that issue too. We are running gmond 3.1.7 on AIX (the packages
  of Michael Perzl) in unicast mode. I've already talked to Michael about
  it, but he has no idea. I'm going to try running gmond under
  strace/truss and maybe we can see something.
 
  Jochen
 
  --
  The only problem with troubleshooting is that the trouble shoots back.
 
 
  --
  Live Security Virtual Conference
  Exclusive live event will cover all the ways today's security and
  threat landscape has changed and how IT managers can respond. Discussions
  will include endpoint security, mobile security and the latest in malware
  threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
  ___
  Ganglia-general mailing list
  Ganglia-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/ganglia-general
 
 --
 Live Security Virtual Conference
 Exclusive live event will cover all the ways today's security and 
 threat landscape has changed and how IT managers can respond. Discussions 
 will include endpoint security, mobile security and the latest in malware 
 threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Having to restart gmond on sender nodes if a collector node restarts

2012-05-04 Thread Carlo Marcelo Arenas Belon
On Fri, May 04, 2012 at 04:47:22PM -0700, Steven A. DuChene wrote:
 OK, but what is a reasonable value to set this to?

there is none, hence why there is no default.

ideally you shouldn't need it just like it wasn't needed before 3.1, but in 
order to support metadata information the protocol had to be changed and there
was no other solution than to put this as a workaround when the unicast 
configuration was reported broken by it.

once a better solution is found (which will likely break the gmond protocol 
again) it shouldn't be needed, but meanwhile telling gmond to periodically 
resend the metadata information for metrics is the only way to deal with a 
collector that went down, and starts receiving (and ignoring) metric data 
without a matching metadata information to use.

 30 seconds? 2minutes? 10minutes?

it depends on how much bandwith/contexts are you willing to sacrifice for 
getting the data you need in case of a collector failure and therefore on 
which kind of setup are you running ganglia.

in HPC (where ganglia) started this might be a showstopper and prevent 
some people to upgrade past 3.0, but in an IT like environments I'd seen 
people use 60 and hope the extra work gets lost on the noise.

I'll recommend you do your own measure and decide based on :

* number on nodes you have
* number of metrics per node
* how much metadata on each metric
* how much CPU/bandwith can you spare on each node for gmond
* how many collectors you have and how many gmond they are all serving (peak)
* how much CPU/badwitch each collector can use
* how sensitive are you to having data holes and restart gmond instead

Carlo

--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [Ganglia-developers] 3.3.5 released today

2012-04-13 Thread Carlo Marcelo Arenas Belon
On Thu, Apr 05, 2012 at 12:52:21AM +0200, Daniel Pocock wrote:
 
 A number of bugs were found during the testing of 3.3.5 and discussed on
 the mailing lists.

could a list of this bugs be published somewhere with the release,
so that anyone knows what to expect if upgrading (most people 
probably still using patched 3.1.7 as that is what is provided by most 
distributions)

from the top of my head there are :

* 2 memory leaks (one probably only in deaf mode)
* gmetad hierarchical mode is broken

  In other words, anyone who is using 3.3.1 or 3.3.0 should not get any
 new bugs from upgrading to 3.3.5

considering that 3.3.5 doubled the in memory size of each metric, it is 
likely to make the memory leaking problems worse though

Carlo

--
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia for Windows

2012-03-28 Thread Carlo Marcelo Arenas Belon
On Wed, Mar 28, 2012 at 12:49:00PM +0100, Burton, Steven wrote:
 
 Should I try those binaries or should I build a more recent version and
 if so, what version?

3.0 and 3.1 are not compatible, so you either :

1) build new binaries for 3.1.7 or newer and deploy it on windows
2) downgrade your server to 3.0.7 (notice you need a couple of patches
   on top of it for security)

Carlo

--
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] tcpconn.py and netstat

2012-02-28 Thread Carlo Marcelo Arenas Belon
On Mon, Feb 27, 2012 at 11:12:39AM -0500, Chris Burroughs wrote:
 Currently tcpconn.py uses netstat to get it's socket stats.  This gives
 lots of detail but is far too slow for much production use (running
 netstat can take many minutes).

tcpconn.py was originally written as an advanced python module example 
as it shows how to do multithreaded modules and how to poll metric 
information from an external source. 

it was found useful enough and some people enabled it, if it made
sense for their environments (definitely not in HPC, or high traffic) 
but since it is a module it can be replaced for something that fits 
better on your environment like the proposal you had.

 /proc/net/sockstat gives less
 information but has no performance problems.  There was a suggestion
 previously to use the ss command, but (1) it's less common (at least not
 part of the default on RHEL5) and (2) it also lacks the high fidelity
 details.
 
 Is there any other reason to prefer ss over cat?  Should this replace
 tcpconn, or be a new module?

most likely (for the reasons explained above) would be a new module 
and if you are concerned about performance, mos likely in C

Carlo  

--
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmond too many open files

2012-02-26 Thread Carlo Marcelo Arenas Belon
On Fri, Feb 24, 2012 at 04:30:09PM -0800, M. Leong Lists wrote:
 
 Is this a bug in the app not closing those files

most likely a module, but to pinpoint which one you would need (assuming
you are running linux) the output of :

  $ sudo lsof -n p `pidof gmond`

Carlo

--
Virtualization  Cloud Management Using Capacity Planning
Cloud computing makes use of virtualization - but cloud computing 
also focuses on allowing computing to be delivered as a service.
http://www.accelacomm.com/jaw/sfnl/114/51521223/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] new Web interface ETA?

2011-05-10 Thread Carlo Marcelo Arenas Belon
On Tue, May 10, 2011 at 08:28:53AM -0500, Alex Dean wrote:
 
 Is anyone using lighttpd?  I'm sure we can come up with some configuration 
 instructions for it as well if desired.

yes, usually most users of lighttpd would use PHP through the fastCGI interface
but that shouldn't make much difference on the way the web application is
configured.

having simple access to the tip of the development in a packable form 
(snapshots),
together with instructions on how to obtain it will be most likely helpful to
increase testing and help stabilizing though.

in that line, would this be good candidate for a snapshot?

  https://github.com/vvuksan/ganglia-misc/tarball/master

and where is the documentation for it being collected?

Carlo

--
Achieve unprecedented app performance and reliability
What every C/C++ and Fortran developer should know.
Learn how Intel has extended the reach of its next-generation tools
to help boost performance applications - inlcuding clusters.
http://p.sf.net/sfu/intel-dev2devmay
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetric and groups

2011-04-21 Thread Carlo Marcelo Arenas Belon
On Thu, Apr 21, 2011 at 11:09:12AM -0400, Jesse Becker wrote:
 On Thu, Apr 21, 2011 at 10:11, Rushton Martin jmrush...@qinetiq.com wrote:
  Is there any way to define which groups a gmetric collected statistic is
  put into? ?I cannot see any way defined on the man page, and the mail
 
 So far as I know, this isn't supported by gmetric.  Groups were added
 after the core gmetic code was written, and that ability hasn't been
 added back to gmetric (yet).

gmetric in trunk has that functionality, it wasn't backported yet though
but it is available and since could be easily added (indeed it has even a
backport proposal for it that was voted to be merged but hasn't), so getting
a  ~3.1.7 gmetric with this functionality is very straight forward.

Carlo

--
Benefiting from Server Virtualization: Beyond Initial Workload 
Consolidation -- Increasing the use of server virtualization is a top
priority.Virtualization can reduce costs, simplify management, and improve 
application availability and disaster protection. Learn more about boosting 
the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] How to locallize the web-frontend?

2011-03-22 Thread Carlo Marcelo Arenas Belon
On Tue, Mar 22, 2011 at 01:43:02PM +0800,  wrote:
 Hey,
 I am in chinese and i like ganglia vey much
 So,i want to translate the web-frontend from enlish to chinese , How can i do?

That would be awesome indeed, but I am affraid that the PHP code used in the
frontend is not really structured for doing localization work in a non
programatic way (like using GNU gettext and message catalogs)

If all you need is a chinese version of the frontend and have some basic PHP
skills, then it should be straight forward to translate the text from the
frontend by editing the template and PHP files, but that would be then very
difficult to integrate back into ganglia for future changes.

if you have some more advanced PHP skills, then adapting the code to be L10N
friendly would be a better approach, and might require as well some deeper
changes in gmond/gmetad for what C/Python skills would be needed.

what is your profile, and what would be you more interested on doing?

Carlo

--
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software 
be a part of the solution? Download the Intel(R) Manageability Checker 
today! http://p.sf.net/sfu/intel-dev2devmar
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Master and Clients not visible at the same time

2010-07-05 Thread Carlo Marcelo Arenas Belon
On Mon, Jul 05, 2010 at 10:49:22AM +0200, Wim De Geeter wrote:
 Any one an idea why ??

because the master and the nodes are in 2 different clusters, for any
of the following reasons :

0) they are in different network segments and running by default in
   multicast, so packets between them are getting dropped at your
   switch
1) you are using different multicast addresses on their configuration
2) you have some other kind of firewall that is blocking packets either
   way.

if you really want them all in the same cluster, you probably want to
change the configuration in the nodes/master to use unicast and point
them to the master.

the manual page in gmond.conf, the README and some of the links on the
wiki are good sources for how to do that, but is is usually as simple
as changing udp_send_channel and udp_recv_channel as shown below and
restarting your gmond.

  udp_send_channel {
host = master
port = 8649
  }

  udp_recv_channel {
port 8649
  }

Carlo

--
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] newbie install of 3.1.7

2010-06-26 Thread Carlo Marcelo Arenas Belon
On Tue, Jun 22, 2010 at 11:16:54AM -0700, Deb Heller-Evans wrote:
 
 In our set up, I am configuring gmond for unicast communication, and have set 
 up gmond.conf on the nodes to have the following:

  52 udp_recv_channel {
 --  53   host = 198.129.76.131
  54   port = 8649
  55 }
  56 
 
 BUT, when starting gmond on the node, gmond complains:
 
 [108#] service gmond start
 Starting GANGLIA gmond: /etc/ganglia/gmond.conf:53: no such option 'host'
 Parse error for '/etc/ganglia/gmond.conf'
 [FAILED]
 
 I'm a little puzzled by this.  Could someone point me in the right direction?

man gmond.conf would show you there is no host option for udp_recv_channel
but probably the option you are looking at is bind which will tell ganglia
to bind to a specific IP for the unicast listener.

Carlo

--
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia + Windows - Compilation Problems?

2010-06-26 Thread Carlo Marcelo Arenas Belon
On Wed, Jun 23, 2010 at 01:40:52PM -0500, Douglas Wagner wrote:
 
 So I build libconfuse on Cygwin on my local XP development box and it gets
 stuck into /usr/local/* (lib, include, etc.).

is it libconfuse 2.7 compiled as an static library and no nls support as
suggested in README.WIN? is this using cygwin 1.5 on 32bit windows or are
you using 1.7?

 Come back around (according to the README.WIN and tell ganglia to compile
 --with-libconfuse=/usr/local and it blows up telling me it can't find
 libconfuse.

config.log would explain why, but hope it is not that you are trying to
build it for 64bit windows.

 Linking everything into /usr/lib doesn't help either.  I've
 seen docs on this but assumed it was supposed to be fixed in 3.1.2.

not sure what you are referring here, but are you trying to build 3.1.7?
noticed the README.WIN documents are not mentioning the need to override
sysconfdir (which is irrelevant for cygwin anyway) and were not completely
updated when the libpcre dependency was added (which also changed name
recently in cygwin) for that release but used to work at least with 3.1.4
from what I remember and therefore probably also for 3.1.2.

the following seemed to work for me on an updated windows vista laptop I
had access with and with the latest cygwin (mostly using instructions
from README.WIN and against the recommendation of sticking with 1.5,
which will therefore require some patching) :

  $ tar -xvzf confuse-2.7.tar.gz
  $ cd confuse-2.7
  $ ./configure --disable-nls
  $ make
  $ make install
  $ cd ..
  $ tar -xvzf ganglia-3.1.7.tar.gz
  $ cd ganglia-3.1.7
  $ find . -type f -name *.h -a ! -name config.h -exec fgrep -l 
rpc/rpc.h {} \; | xargs -n1 perl -pi -e s;#include rpc/rpc.h;#include 
cygwin/in.h\n#include rpc/rpc.h;g
  $ ./configure GANGLIA_ACK_SYSCONFDIR=1 --with-libconfuse=/usr/local 
--enable-static-build
  $ make
  $ cd ..
  $ mkdir dist
  $ cp -a ganglia-3.1.7/gmond/gmond.exe dist/
  $ cp -a ganglia-3.1.7/gmetric/gmetric.exe dist/
  $ cp -a ganglia-3.1.7/gstat/gstat.exe dist/
  $ cd confuse
  $ make uninstall
  $ cd ..
  $ rm -rf confuse* ganglia*

the binaries in dist will need to be installed in the other nodes probably
including the corresponding cygwin dll that they were built with if cygwin
won't be installed independently (cygwin1.dll, cygapr-1-0.dll, cygexpat-1.dll,
cygpcre-0.dll, and libpython2.6.dll).

the following dependencies were installed as prerequisites on the system 
that was used for building this package (listed with `cygcheck.exe -c -d`) :

  diffutils2.9-1
  expat2.0.1-1
  libexpat12.0.1-1
  libexpat1-devel  2.0.1-1
  gcc  3.4.4-999
  gcc-core 3.4.4-999
  gcc-g++  3.4.4-999
  gcc-mingw-core   20050522-1
  gcc-mingw-g++20050522-1
  libgcc1  4.3.4-3
  libapr1  1.4.2-1
  libapr1-devel1.4.2-1
  make 3.81-2
  libpcre-devel8.02-1
  libpcre0 8.02-1
  python   2.6.5-2
  sharutils4.8-1
  sunrpc   4.0-3

Carlo

--
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] ganglia monitors does not work for some of the clusters

2010-06-26 Thread Carlo Marcelo Arenas Belon
On Thu, Jun 24, 2010 at 07:37:25AM +0200, Raimund Eimann wrote:
 
 I have exactly the same issue with version 3.1.7. When I restart gmond on
 the affected nodes, their graphs work for some time (1-2 days typically). I
 use CentOS 5.{4,5} on my nodes. Usually the problem does not affect a
 cluster as a whole, but only a large number of nodes in the cluster (for
 insance, for 14 out of 17 nodes nothing gets displayed).

are you using multicast or unicast? does setting send_metadata_interval to
60 or some other non zero value help?

Carlo

--
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Gmond udp_send_channel using the wrong network (seems hostname related)

2010-06-26 Thread Carlo Marcelo Arenas Belon
On Thu, Jun 24, 2010 at 10:21:53AM +, Ronny wrote:
 
 I am facing the problem, that my gmond udp_send_channels sends via the wrong 
 network interface on a multi homed linux machine.

there is some information on multihomed setups in the README which could
help.

 The machines have a front NIC and an backend NIC. Both IPs from the NICs get 
 resolved by the name service, but the primary IP's dns name is the system's 
 hostname (with an IP address out of 62.48.x.x)
 
 In my clients gmond.conf I have set:
 
 udp_send_channel {
   bind_hostname = yes # Highly recommended, soon to be default.
# This option tells gmond to use a source address
# that resolves to the machine's hostname.  Without
# this, the metrics may appear to come from any
# interface and the DNS names associated with
# those IPs will be used to create the RRDs.
   host = 10.0.11.16
   port = 8649
   ttl = 1
 
 }
 
 whereby 10.0.11.16 is the backend network.
 
 But this gmond seems to ignore to use 10.0.11.16 and sends via the
 primary IP adress 62.48.x.x to the udp_receive_channel locatet on
 another host. A firewall between send_channel and receiver channel
 machines using 62.48.x.x is blocking that traffic. I can't currently
 open the firewall.

that is what bind_hostname is meant to do AFAIK, maybe you would like to use
instead bind = 10.0.11.16 (host should point to your collector if using
unicast, so host and bind should be most of the time different ips in
10.0.11.x unlike this example)

Carlo

--
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Gmond udp_send_channel using the wrong network (seems hostname related)

2010-06-26 Thread Carlo Marcelo Arenas Belon
On Sat, Jun 26, 2010 at 03:29:17PM -0400, Vladimir Vuksan wrote:

 More than 4 years ago I reported a bug regarding gmond not honoring
 mcast_If setting 
 
   http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=94

mcast_if should be working fine in 3.0 since 3.0.5, could you confirm
that? now you should be able to force multicast traffic to go through
a specific interface if adding mcast_if into the corresponding
udp_send_channel setting.
 
it was broken again though in 3.1 and while it was fixed again for
3.1.2 as shown by BUG140 you would need 3.1.7 for a full fix and set
of directives that are meant to help control all parts of functionality
including also the IP that would be used as the source (which is what
bind and bind_hostname are for) independently of the interface or IPv4
routing.

 We resolved it by adding a route. It would seem that in unicast mode
 this should require no changes. Can you send us what your routing table
 looks like ?

unicast could use a different IP as the source if instructed to do so
by explicitally binding to it or to the resolvable hostname as it
seemed by the original reported configuration.

agree though documentation is a little thin around of all it (there is
also some complementary explanation in the README) specially with 3.1.7
which has now several overriding settings that affect this (routing, 
mcast_if, and bind/bind_hostname)

Carlo

--
This SF.net email is sponsored by Sprint
What will you do first with EVO, the first 4G phone?
Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] mcast_if in ganglia 3.1.2 using the wrong source address

2010-01-09 Thread Carlo Marcelo Arenas Belon
On Sat, Jan 09, 2010 at 01:50:42PM +0100, Stefan Ott wrote:
 
 I'm not sure whether this has been reported yet but since I couldn't
 find a report, I assume it hasn't.

it has been reported [1] and a fix is available and committed in trunk
in r2121 [2] and should hopefully be part of 3.1.6 when released.

 I'm using ganglia 3.1.2 and I have the following issue: if I set
 mcast_if in my udp_send_channel section, the packets *do* use the
 right interface, but the wrong source address (they use the address
 from the other interface). If I manually add a route to 239.2.11.72
 (my multicast group) it works.

this is also poorly documented (there is an obsolete note in the README
under the How should I configure multihomed machines section)
but if you are using Linux one way to workaround the issue is to add
a static route to the multicast IP or network through the interface you
configured in mcast_if, if you are using Solaris [3] then you need to
apply a patch to your ganglia or wait for 3.1.6 to use either the bind
or bind_hostname parameters for udp_send_channel or rely on the fixed
mcast_if parameter which is still not backported, if you are using
something else and still  having the issue then would be a good idea
if you try either workaround/solutions and report back so we can then
modify the fix further to accommodate also for your case (more changes
will be still needed for sure to document correctly this issue as well
in any case), 3.0 is suspected not to be affected but if anyone can
confirm/deny that then we could also apply this fix for 3.0.8 and close
BUG94 [4] if still open.

Carlo

[1] http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=140
[2] http://ganglia.svn.sourceforge.net/viewvc/ganglia?view=revrevision=2121
[3] 
http://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg05230.html
[4] http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=94

--
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Extending the format of gmetad.conf

2010-01-04 Thread Carlo Marcelo Arenas Belon
On Mon, Dec 28, 2009 at 10:16:28PM +, Daniel Pocock wrote:
 
 I'm looking at extending the gmetad.conf format, while still making sure 
 that it can read the existing config files.

adding a new configuration option would be the easiest way to prevent
any backward incompatible change which will then force this feature
to be 3.2+ only.

 My goal is to allow different sets of RRAs for different sources, while 
 making sure the existing file format remains valid.

why do you want to have this? what is the use case for having different
metric storage frequencies per cluster and why can't be done by having
instead independent gmetad?

if you are talking about different metric storage frequencies per metric
as it seems to be implied later (and which is a feature long in the wishlist)
then wouldn't be safe to assume you want that storage for that metric regardless
of source?, if that is the case it will simplify the implementation and will
only require something like RRAs_template as shown in d and not need
a, b, or c at all (or at least not as part of the first implementation).

currently in data_source the polling interval is optional and so the same
could be done with the template to apply in the long run, but complicating
the configuration parser, for IMHO no really good reason.

using a script is definitely interesting because of the flexibility it allows
for, but as mentioned before a problem because of the additional forking
required and also problematic because it will keep part of the logic outside
gmetad.

Carlo

--
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Extending the format of gmetad.conf

2010-01-04 Thread Carlo Marcelo Arenas Belon
On Mon, Jan 04, 2010 at 06:55:36AM -0500, Jesse Becker wrote:
 On Mon, Jan 4, 2010 at 03:46, Carlo Marcelo Arenas Belon
 care...@sajinet.com.pe wrote:
  My goal is to allow different sets of RRAs for different sources, while
  making sure the existing file format remains valid.
 
  why do you want to have this? what is the use case for having different
  metric storage frequencies per cluster and why can't be done by having
  instead independent gmetad?
 
 I can think of reasons why you'd want different frequencies for the
 same metric, mostly having to do with required data retention policies
 and lack of resources (disk space).  It could be done with different
 gmetad processes, but that gets complicated for a simple cluster
 (multiple gmetad polling the same gmond, the same data is displayed in
 two different locations).

Of course I can think of reasons why that might be something you would want
to have, and that is why I said that might be needed in the long run if those
reasons are genuine, but I will be surprised if there is a reason good enough
to do that from the very beginning when using multiple gmetad would solve it
for now IMHO.

The point is that the syntactical sugar to make that work would be far more
complicated and difficult to do in a 3.1 compatible way than just adding
templates and therefore I would tend to believe that it would make more
sense as a 3.2 feature, while having different RRAs independently of which
datasource has been in the wishlist since even before 3.1.0 was released
and would be something you would instead want backported ASAP (probably
even to 3.0 if there is demand)

  if you are talking about different metric storage frequencies per metric
  as it seems to be implied later (and which is a feature long in the 
  wishlist)
  then wouldn't be safe to assume you want that storage for that metric 
  regardless
  of source?, if that is the case it will simplify the implementation and will
  only require something like RRAs_template as shown in d and not need
  a, b, or c at all (or at least not as part of the first 
  implementation).
 
  currently in data_source the polling interval is optional and so the same
  could be done with the template to apply in the long run, but complicating
  the configuration parser, for IMHO no really good reason.
 
  using a script is definitely interesting because of the flexibility it 
  allows
  for, but as mentioned before a problem because of the additional forking
  required and also problematic because it will keep part of the logic outside
  gmetad.
 
 Perhaps I'm misunderstanding how using a separate script would work,
 but there would only be a fork storm during initial RRD creation,
 correct?

it depends on what the script does, but that is correct in the case that the
script is only returning the RRAs back to gmetad as you suggested.

still the disadvantage (as mentioned above) of not being able to know from
just reading the gmetad.conf which RRA apply on each case still applies and
would probably imply that the best way to do this will be to make gmetad
modular (just like gmond) and then allow it to write its own configuration
or use one by default that could be used as a starting point just like
`gmond -t` allows for.

  I had assumed that the current behavior of keep existing
 RRD file would remain.  Thus, the only time we would really have to
 worry about forking off hundreds/thousands of processes would be
 when a new cluster is created, or when the RRD files are all removed
 for some reason.  Under normal operating circumstances, the RRD files
 already exist, so there's no need to run the creation script.

or when gmetad is restarted and have to again figure out which RRA apply on
each case for the updates and unless gmetad.conf has all that information
somehow in a static way (by using for example the modular solution instead
of just a script).

Carlo

--
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Debugging problems with Ganglia

2009-12-18 Thread Carlo Marcelo Arenas Belon
On Thu, Dec 17, 2009 at 01:58:31PM -0500, Douglas Wade Needham wrote:
 
 Here are the details:
   Version:3.1.2
   Host:   VM running Debian Etch (1 VCPU, 512 MB) w/ 
 2.6.26-2-xen-amd64 kernel
 HW Host:  Dual AMD Opteron 244's running at 1.8GHz

Assume here Host really means the virtual server that is allocated for this
task and therefore not to be confused with HW Host below which is also
running Debian as a virtual server host, and all of them are stock debian
etch, correct?

I'd seen severe IO starvation issues (including panics) using the stock
debian guest kernel which only resolved themselves after upgrading it to
2.6.32 (here guest refers to the kernel used in the virtual machine),
not to the kernel in the server that hosts all the VM machines and that
will be more difficult to upgrade.

 When we have a program connect across our 1Gbps network connection to
 this gmetad, we end up with very gappy data, if the hosts don't just
 get marked as down and the RRDs stop updating.  I have already started
 pressuring those who would approve moving our RRDs to a memory fs,
 but in the meantime... :(

IMHO this is your only solution considering the IO issues that VMs have.
and the way the gmetad scales and uses RRD (which is also very IO unfriendly)
eventhough it has been mentioned that rrdcached could help somehow.

 I have been running straces which have indicated that we occasionally
 have threads which block on the futex() call for 10+ seconds, and
 occasionally for as long as 500+ seconds.  To limit the impact of the
 strace (which itself can cause the same problem), I even had to do:
 
   strace -f -tt -T -s 160 -e trace=process,futex,signal -o 
 gmetad.strace.out2 -p 12618

interesting, and I suspect IO related most likely but you would be probably
able to get a better picture using instead a pthread specific tracer like
mutrace (warning, fairly new code and only packaged for Fedora 12 AFAIK) :

  http://git.0pointer.de/?p=mutrace.git

 But in doing this, I have come up with the following questions:
 
 1) Is there any difference between '-d 1' and '-d 10'?  Or between
'debug 1' and 'debug 10' in the config file?
 
In looking through the code, it does not seem to be the case.  I
would just like confirmation.

not for gmetad AFAIK, but there are several arbitrary uses of debug_level
which usually mean you want to use the highest level possible most of the
time anyway.

 2) Am I seeing correctly that we have the following pthread_mutex
definitions?
 
- server_socket_mutex
- server_interactive_mutex
- Allocated mutex for root summary.
- Allocated mutex for each grid partial-summary (1 per data source)
- Allocated mutex for each cluster partial summary (1+ per data source)

there is also an rrd_mutex for updating the RRD, and would recommend keep
away from the multiple summary mutexes if you want to keep your sanity.

 3) Would there be any interests in patches against 3.1.2 to watch
calls to pthread_mutex_lock() and pthread_mutex_unlock() to display
when a call took more than a certain amount of time to return, or
if a lock was held for longer than a certain time??

definitely interesting and if to be enabled (preferably) at compile time
to avoid any added performance degradation and race conditions of its own,
but probably OK too if only enabled at run time through debug mode.

beware though that trunk (where the patch would need to be applied first)
and 3.1 (where 3.1.2 comes from) might not be on sync on this code which
has seen several changes lately.

would be interesting also to see how a patched 3.0.7 (or the 3.0 branch HEAD)
would perform in this case as an alternative.

there is also a python version of gmetad in trunk which might help with
what you are doing.

 This last one comes, as given my suspicions on thread starvation, I am
 going to have to instrument a gmetad a bit more to look at the mutexes
 and how long we are in critical sections.

beware gmetad code is a little rusty so report back if you see anything
else that doesn't look quiet right.

Carlo

PS. this thread might be better fitted for ganglia-developers.

--
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia trouble (sorry)

2009-12-16 Thread Carlo Marcelo Arenas Belon
On Wed, Dec 16, 2009 at 08:37:44AM +0100, Tommy Schneider wrote:
 
 Also, for RPMS... the only RPMs i found (for EL5 Linux) are for ganglia 3.0.7

have you tried using the EPEL packages for 3.0.7 instead?, why would you need
to move to 3.1.2 manually?, 3.0 is still a supported branch/version anyway.

Carlo

--
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Hadoop and ganglia

2009-12-10 Thread Carlo Marcelo Arenas Belon
On Wed, Dec 09, 2009 at 11:33:59PM -0500, John Martyniak wrote:

 So on the over page, none of the graphs have any data.  The number of  
 hosts is correct, but the number of CPUs is 0.  If pick cpu_report as  
 the metric, it results in a broken image for each node detail graphs,  
 and the overview graphs show no data.

you have all gmond reporting their metrics too? (AKA mute = no)?

 If I drill down to look at specific data about each node, the graphs  
 show information but not all graphs, and some show broken images.

the web frontend for 3.0 usually assumes that the core metrics are
being reported and if they are not by configuration then those
reports should be broken (as expected, because you explicitally
told gmond not to generate that data).

Carlo

--
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Hadoop and ganglia

2009-12-10 Thread Carlo Marcelo Arenas Belon
On Thu, Dec 10, 2009 at 08:52:56AM -0500, John Martyniak wrote:

 Yes I have mute=yes at this time, so as to filter out the localhost,

by localhost you mean the web/gmetad machine?, then you might as
well instead then just shutdown that gmond and poing gmetad to use
one of the gmond in the cluster of hadoop boxes instead as a
data source

 since it is just a monitoring machine, and not really part of the  
 cluster.

if you want to have also reports for the core metrics (like cpu
utilization and count) in the machines that are part of the cluster
then all of them should be mute = no and have reasonable thresholds
and collection rates as the ones provided by the default configuration.

Carlo

--
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Problems with Ganglia web interface

2009-12-10 Thread Carlo Marcelo Arenas Belon
On Thu, Dec 10, 2009 at 04:17:18PM +0100, Samuel Gimeno wrote:
 All Xml of all gmond and gmetad are well formed, all echos OK.
 
 Did you say something about apparmor problems? What I can make to fix it? I
 think that that can be the problem all the other things I tried are good...

no idea as I don't use OpenSUSE but google suggested you try :

  
http://developer.novell.com/wiki/index.php/Apparmor_FAQ#How_do_I_enable.2Fdisable_AppArmor.3F

Carlo

--
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmond 3.1.2 becomes deaf in Solaris SPARC

2009-12-10 Thread Carlo Marcelo Arenas Belon
On Thu, Dec 10, 2009 at 10:53:50AM -0500, Jorge Medina wrote:
 
 This time my gmond stayed awake much longer, but eventually went deaf (after 
 15 hours). 

then you have another problem (maybe in addition);

could you see if by chance that is no longer the case with the following
package :

  http://sajino.sajinet.com.pe/ganglia/ganglia-3.1.5.2147.tar.gz

this one should also had fix the issue you had previously while trying
to build ganglia (BUG215) [1]

also, unless you have a very good reason not to, you are most likely
better of building ganglia as a 32bit application (with 32bit dependent
libraries) which are known to be tested more.

eithercase the behaviour you see is a most likely a BUG and will be a
good idea to track it down, report it and get it fixed in the long run.

Carlo

[1] http://sajino.sajinet.com.pe/ganglia/ganglia-3.1.5.2147.tar.gz

--
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Problems with Ganglia web interface

2009-12-09 Thread Carlo Marcelo Arenas Belon
On Wed, Dec 09, 2009 at 05:47:39PM +0100, Samuel Gimeno wrote:
 
 I compiled from source in an OpenSUSE 11.1, all gone well but I have some
 problems with web interface, alternative it shows some errors messages:
 Ganglia cannot find a data source. Is gmond running?Cannot find any
 metrics for selected cluster janus, exiting. Check ganglia XML tree
 (telnet 127.0.0.1 8651)
 
 There was an error collecting ganglia data (127.0.0.1:8651): XML error:
 Invalid document end at 53

you mean it sometimes doesn't show that message and just works?, or it always
shows an error regardless of how many times you reload or hit the get fresh 
data
button?

if the earlier you have most likely a problem with some hostname or metric that
is generating invalid XML, or other problem that makes your XML invalid 
sometimes
(a very slow gmetad, or very slow IO, or a bug on the version of ganglia you
compiled and that forgot to mention), if the later you probably have a problem
with selinux, apparmor, iptables  or some other equivalent system (don't have
opensuse, so can't confirm) which is preventing the ganglia web application to
connect to the gmetad process in port 8651 and 8652 (which you forgot to confirm
but would assume from your explanation is the same host were the web application
is installed as recommended)

 And gmond is running, the ganglia XML tree show metrics and XML is correct,
 I try telnet localhost 8649 and telnet localhost 8651 and host running
 gmetad and telnet localhost 8649 gives an Xml valid and with metrics.
 Gmond and Gmetad works well in host and clients.

if the problem is sporadic you might be able to do dump the contents of TCP:8651
in a loop and pass them to an xml validator (like xmllint) for hints.

 I'm going crazy trying to make it works, its the second time that I install
 and then works well, only some problems of misconfiguration.

sadly opensuse doesn't have AFAIK official packages for ganglia 3 but there are
some unnoficial ones you might be able to use from :

  http://software.opensuse.org/search?baseproject=openSUSE%3A11.2p=1q=ganglia

Carlo

--
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] multicast source address on Solaris

2009-12-09 Thread Carlo Marcelo Arenas Belon
On Wed, Dec 09, 2009 at 06:02:18PM +, River Tarnell wrote:
 Carlo Marcelo Arenas Belon:
  the attached draft patch should correct the problem
 
 thanks - after applying this patch, mcast_if works correctly.  will this
 be included in the next release?

it is already committed to trunk (what will be 3.2 sometime) with r2121 but
hasn't yet been backported to 3.1 (which is now getting prepared for 3.1.6).

Carlo

--
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Hadoop and ganglia

2009-12-09 Thread Carlo Marcelo Arenas Belon
On Wed, Dec 09, 2009 at 04:49:43PM -0500, John Martyniak wrote:
 
 /* Feel free to specify as many udp_send_channels as you like.  Gmond
 used to only support having a single channel */
 udp_send_channel {
mcast_join = 239.2.11.71
port = 8649
ttl = 1
 }
 
 /* You can specify as many udp_recv_channels as you like as well. */
 udp_recv_channel {
mcast_join = 239.2.11.71
port = 8649
bind = 239.2.11.71
 }
 
 tcp_accept_channel {
port = 8649
 }
 
 This is the config from the gmetad.conf:
 data_source my cluster 10.1.1.25 cloud1

cloud1 doesn't make sense there, unless that is the hostname of one
of the servers running gmond and that you want to have as a backup of 10.1.1.25

 everything else is commented out.

what else is commented out?, hope you don't mean everything but what you
showed above for gmond.conf was commented out but only the information
in gmetad.conf, right?

 Also this is version 3.0.7.

BTW if you want to use the default configurations you can even run without a
gmond.conf, but be sure to get the right matching configuration on your
hadoop cluster and restart everything.

some more information of interest, including a nice detailed description
from :

  http://ganglia.info/?p=88

Carlo

--
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Problems with Ganglia web interface

2009-12-09 Thread Carlo Marcelo Arenas Belon
On Wed, Dec 09, 2009 at 11:02:14PM +0100, Samuel Gimeno wrote:
 The version compilled was 3.1.2, which command of xmllint should I use to
 see that Xml is well formed?

assuming you have also netcat (or nc) are running this from the gmetad server :

$ netcat 127.0.0.1 8651 | xmllint --noout -  echo OK || echo BROKEN

--
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Gaps in graphs

2009-12-07 Thread Carlo Marcelo Arenas Belon
On Sun, Dec 06, 2009 at 09:10:06AM +, Daniel Pocock wrote:
 
  The code from trunk does support the new rrdcached feature that comes
  with newer versions of RRDTool, but that code is currently still in
  development.

rrdtool 1.4 was released already, including rrdcached, what is still
in development?

 Some very big sites use that in production already - it is backported on 
 the 3.1.3/4/5 betas - it is highly recommended

doesn't seem to be a reference to it on the release notes or documentation
and the code from trunk that references to it doesn't seem to be committed
in 3.1, could you elaborate?

Carlo

--
Join us December 9, 2009 for the Red Hat Virtual Experience,
a free event focused on virtualization and cloud computing. 
Attend in-depth sessions from your desk. Your couch. Anywhere.
http://p.sf.net/sfu/redhat-sfdev2dev
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia 3.1.5 beta ready for final testing

2009-12-07 Thread Carlo Marcelo Arenas Belon
On Mon, Dec 07, 2009 at 05:19:47PM +, Paul Sobey wrote:
 On Mon, 7 Dec 2009, Carlo Marcelo Arenas Belon wrote:

 It should work fine but if you are feeling brave and want to try what would
 happen also if using C99 then try (beware the name of the package was changed
 but will still unpack over the same directory, so you need to cleanup after
 building one package if looking at the other):

  http://sajino.sajinet.com.pe/ganglia/el4-ganglia-3.1.5.2145.tar.gz (!C99)
  http://sajino.sajinet.com.pe/ganglia/f12-ganglia-3.1.5.2145.tar.gz (C99)

 the package that was bootstrapped in fedora 12 (f12) would use C99 to build
 the ganglia sources while the other package (el4) should be otherwise
 equivalent to the one you tested before.

 I haven't bothered trying the el4 package since you stated it was  
 (probably) equivalent. Confirm the the f12 package doesn't compile 
 against a new python 2.6.2 giving the same error as mentioned earlier in 
 the thread.

Oops, somehow forgot to mention the workaround for that which will be to do
before calling configure the following :

  $ ac_cv_prog_cc_c99=no
  $ export ac_cv_prog_cc_c99

Could you validate the following package works in your setup :

  http://sajino.sajinet.com.pe/ganglia/ganglia-3.1.5.2147.tar.gz (*)

It works in Solaris 10u8 x86 using both SUNWgcc and Sun Studio 12u1 and
built against SUNWPython at least but it will most likely need to be
worked around to disable C99 in Solaris 8 and 9 (CC Daniel to check that)

Carlo

(*) not the official 3.1.5.2147

--
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia 3.1.5 beta ready for final testing

2009-12-06 Thread Carlo Marcelo Arenas Belon
On Sun, Dec 06, 2009 at 05:09:08PM +, Paul Sobey wrote:
 On Saturday 05 December 2009 18:01:00 Carlo Marcelo Arenas Belon wrote:
  
  (*) this is not the official 3.1.5.2139 package but one that was patched
  additionally with relevant backports from trunk and bootstrapped in
  Centos 4 for added injury.
 
 Compiles perfectly against a new python 2.6.2 - thankyou! I thought the 
 following thread might be of use:
 
 http://bugs.python.org/issue1759169
 
 but it would seem that I was either wrong or that you've worked around it.

that it is still relevant for your python binaries as you would have otherwise
problems if using C99.  the ganglia package you used worked around it by not
building with C99 (which is what bootstrapping with CentOS 4 does)  but
still getting a fix for BUG215 so it wouldn't need to tweak CC or CFLAGS :

  http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=215

 I haven't tested it yet - will do so tomorrow.

It should work fine but if you are feeling brave and want to try what would
happen also if using C99 then try (beware the name of the package was changed
but will still unpack over the same directory, so you need to cleanup after
building one package if looking at the other):

  http://sajino.sajinet.com.pe/ganglia/el4-ganglia-3.1.5.2145.tar.gz (!C99)
  http://sajino.sajinet.com.pe/ganglia/f12-ganglia-3.1.5.2145.tar.gz (C99)

the package that was bootstrapped in fedora 12 (f12) would use C99 to build
the ganglia sources while the other package (el4) should be otherwise
equivalent to the one you tested before.

 I was under the impression that 
 the python module was needed to get per volume disk stats, is that not the 
 case?

there is a python module called multidisk.py that does that, but AFAIK it
doesn't support ZFS and is linux (might work partially in others) only though.

 We have lots of zfs volumes and I'd like to be able to graph usage of 
 each, and hopefully use rrd's trending to get some sort of prediction when 
 I'll need a bigger fileserver.

fixing multidisk.py to understand zfs (specially the difference between
zpools and file systems) would be a solution for that and unless you decide to
ignore the extra metrics which will be created otherwise, most likely (if it
even works)

you can also make a script which will collect the values you will be interested
and use gmetric to generate the right metrics as an alternative (which doesn't
require python support and would work even with ganglia 3.0)

 Also is it safe to use this version with a 
 collector gmond @3.1.2 or should I upgrade all? That's a slightly more 
 tiresome operation which take me a while a longer...

all version of 3.1.x are compatible between them, so you don't really need
to upgrade all the others, unless you want to.

Carlo

--
Join us December 9, 2009 for the Red Hat Virtual Experience,
a free event focused on virtualization and cloud computing. 
Attend in-depth sessions from your desk. Your couch. Anywhere.
http://p.sf.net/sfu/redhat-sfdev2dev
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia 3.1.5 beta ready for final testing

2009-12-05 Thread Carlo Marcelo Arenas Belon
On Fri, Dec 04, 2009 at 04:12:30PM +, Paul Sobey wrote:
 On Wed, 25 Nov 2009, Carlo Marcelo Arenas Belon wrote:

 On Tue, Nov 24, 2009 at 09:05:12PM -0800, Bernard Li wrote:

 I should clarify, what I meant to say was:

 The resulting tarball should build fine on Solaris without Python support

 I don't believe this is a regression from previous release(s)

 not really; 3.1.4 builds fine with python support in Solaris 10 as shown in
 the thread you linked to :

  
 http://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg05098.html

 the new bootstrapping might also affect Solaris 8 or older releases, which is
 why I don't think that expediting this release will make much sense, 
 specially
 considering that there are almost no benefits IMHO.

 Apols but I still can't make 3.1.4 build with python support - and there  
 are no csw dependencies anywhere in the chain for me. I welcome  
 suggestions though - I've just realised I need the python module to watch 
 the 200 zfs filesystems on one of my thumpers!

you are a brave man, and eventhough you had done already your fair share
of beta testing hope you don't mind trying a patched snapshot of what would
hopefully become later 3.1.6 and which might solve your problem :

  http://sajino.sajinet.com.pe/ganglia/ganglia-3.1.5.2139.tar.gz (*)

it of course builds fine for me using the setup described on the linked
URL above.

also curious on how having working python support would help with zfs?

Carlo

(*) this is not the official 3.1.5.2139 package but one that was patched
additionally with relevant backports from trunk and bootstrapped in
Centos 4 for added injury.

--
Join us December 9, 2009 for the Red Hat Virtual Experience,
a free event focused on virtualization and cloud computing. 
Attend in-depth sessions from your desk. Your couch. Anywhere.
http://p.sf.net/sfu/redhat-sfdev2dev
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.5 beta ready for final testing

2009-12-03 Thread Carlo Marcelo Arenas Belon
On Wed, Dec 02, 2009 at 07:41:39PM +, Daniel Pocock wrote:
 
 Therefore, the approach might need to be some combination of the 
 solutions.  E.g. a configure option that allows people to choose the new 
 behaviour or the old behaviour.

-1, this will double our supported paths for almost no gain and knowing
that at least 50% are broken, and still underscores the nature of the problem.
because changing the initialization would affect also (in a platform specific
way) things like threaded gmond modules and the resources they rely on just
as an example.

 As we know the new behaviour works on Solaris and Linux

with the version of APR that was tested it with, which is also a moving
target.

 then the package can be built the new way on those 
 platforms by default.  On BSD, users could choose what they want by 
 setting a configure option.  If a user had an updated apr (provided such 
 update is feasible), they might compile with the new behaviour.

again, this is not a BSD specific problem (indeed I suspect that solaris
might be affected as well, specially in cases where APR was compiled to
use port_getn), because then apr_poll_* has slightly different semantics
than poll and therefore could result in platform specific failures that
might not be as obvious as it was kqueue for the BSD.

the problem that we were trying to solve was just to propagate correctly
the status from the gmond daemon to the caller and for a proof of concept
in that direction (as suggested before) refer to :

  
http://www.mail-archive.com/ganglia-develop...@lists.sourceforge.net/msg05390.html

Carlo

--
Join us December 9, 2009 for the Red Hat Virtual Experience,
a free event focused on virtualization and cloud computing. 
Attend in-depth sessions from your desk. Your couch. Anywhere.
http://p.sf.net/sfu/redhat-sfdev2dev
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.5 beta ready for final testing

2009-12-02 Thread Carlo Marcelo Arenas Belon
On Wed, Dec 02, 2009 at 01:57:44AM +, Carlo Marcelo Arenas Belon wrote:
 On Tue, Dec 01, 2009 at 10:20:32PM +, Daniel Pocock wrote:
  - Can you easily re-compile APR with a different poll implementation?  I  
  think you can change it from configure.
 
 Which option?, --enable-other-child doesn't make a difference and considering
 how many different versions of APR are installed in all affected systems I
 would be surprised this to be an APR issue.

and surprised I am, as the problem goes away if APR is forced to use poll
instead of kqueue.

but that of course requires a patched version of apr (including bootstrapping)
and is probably not an option, unless we go back to the dark ages of including
all dependencies statically.

if anyone is interested I am attaching a patch for apr-1.3.9 which could be
used to fix this problem in {Free,Net,Open}BSD and which will also require
that ganglia be linked with the patched library by doing something like (using
/opt/ganglia to avoid clashing with the system provided packages and ignoring
the fact that you would need to be root with a bourne shell to execute the
following incantation, and that is very unlikely to be a good idea anyway) :

  # mkdir -p /opt/ganglia
  # tar -xvzf apr-1.3.9.tar.gz
  # cd apr-1.3.9
  # patch -p1  apr-1.3.9-configure-disablekqueue.patch
  # ./confgure --prefix=/opt/ganglia
  # make
  # make install
  # cd ..
  # tar -xvzf ganglia-3.1.5.tar.gz
  # cd ganglia-3.1.5
  # ./configure --prefix=/opt/ganglia 
--with-libapr=/opt/ganglia/bin/apr-1-config
  # make
  # make install
  # LD_LIBRARY_PATH=/opt/ganglia/lib /opt/ganglia/bin/gmond

Carlo

PS. DragonFlyBSD will be still affected and MacOS X was probably luckily not
--- apr-1.3.9/configure	Mon Sep 21 14:59:34 2009
+++ apr-1.3.9/configure	Wed Dec  2 01:45:45 2009
@@ -5762,6 +5762,10 @@
 ac_cv_o_nonblock_inherited=yes
   fi
 
+  if test -z $ac_cv_func_kqueue; then
+test x$silent != xyes  echo   setting ac_cv_func_kqueue to \no\
+ac_cv_func_kqueue=no
+  fi
 	;;
 *-netbsd*)
 
@@ -5792,6 +5796,10 @@
 ac_cv_o_nonblock_inherited=yes
   fi
 
+  if test -z $ac_cv_func_kqueue; then
+test x$silent != xyes  echo   setting ac_cv_func_kqueue to \no\
+ac_cv_func_kqueue=no
+  fi
 	;;
 *-freebsd*)
 
@@ -5838,15 +5846,12 @@
   fi
 
 fi
-# prevent use of KQueue before FreeBSD 4.8
-if test $os_version -lt 48; then
-
+# prevent use of KQueue
   if test -z $ac_cv_func_kqueue; then
 test x$silent != xyes  echo   setting ac_cv_func_kqueue to \no\
 ac_cv_func_kqueue=no
   fi
 
-fi
 	;;
 *-k*bsd*-gnu)
 
--
Join us December 9, 2009 for the Red Hat Virtual Experience,
a free event focused on virtualization and cloud computing. 
Attend in-depth sessions from your desk. Your couch. Anywhere.
http://p.sf.net/sfu/redhat-sfdev2dev___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.5 beta ready for final testing

2009-12-02 Thread Carlo Marcelo Arenas Belon
On Wed, Dec 02, 2009 at 10:36:02AM +, Daniel Pocock wrote:
 Carlo Marcelo Arenas Belon wrote:

 but that of course requires a patched version of apr (including
 bootstrapping) and is probably not an option, unless we go back
 to the dark ages of including all dependencies statically.
   
 Maybe we can fix apr too

Changing APR to use poll instead of kqueue is a way to fix it, but then
to be able to use that fix, we will need to go back to have our own
apr tree.

 I found a description of the same issue in a Google search on the subject:

 http://www.google.com/search?q=kqueue+fork+process+bad+file+descriptor;
 http://www.mail-archive.com/freebsd-hack...@freebsd.org/msg69516.html

 Can you try re-enabling kqueue and patching apr to use rfork()?

Doesn't work, and fails now on sending of the metrics, because of course
this time the parent process close that socket and the child can use it
after that.

The only viable solution I see is to delay the creation of all the sockets
until daemonized as it was being done originally.

If you really need to avoid having the parent report back on issues on that
then you are going to keep the parent around and send the status back from
the child until getting into the main loop through a unix socket or similar
instead as you suggested originally was another option.

Carlo

--
Join us December 9, 2009 for the Red Hat Virtual Experience,
a free event focused on virtualization and cloud computing. 
Attend in-depth sessions from your desk. Your couch. Anywhere.
http://p.sf.net/sfu/redhat-sfdev2dev
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.5 beta ready for final testing

2009-12-02 Thread Carlo Marcelo Arenas Belon
On Wed, Dec 02, 2009 at 11:17:26AM +, Daniel Pocock wrote:
 Carlo Marcelo Arenas Belon wrote:
 On Wed, Dec 02, 2009 at 10:36:02AM +, Daniel Pocock wrote:

 Can you try re-enabling kqueue and patching apr to use rfork()?

 Doesn't work, and fails now on sending of the metrics, because of course
 this time the parent process close that socket and the child can use it
 after that.

 The only viable solution I see is to delay the creation of all the sockets
 until daemonized as it was being done originally.
   
 The problem with that is that if another process is already listening on  
 one of the ports wanted by gmond, then the listener set up will fail,  
 but if the problem is only detected after daemonizing, then the caller  
 doesn't know about the failure.

but that is something that could be fixed at the caller level but just
checking if the port is bound to something already before calling gmond.

agree that is not elegant, but is better than the current situation where
you can't start gmond at all.

 If you really need to avoid having the parent report back on issues on that
 then you are going to keep the parent around and send the status back from
 the child until getting into the main loop through a unix socket or similar
 instead as you suggested originally was another option.
   
 That is not as easy to implement in apr as the apr_proc_detach() call.   

frankly I don't like much all the abstractions that apr_* provides because
makes simple things like this more complicated (specially because of the
unintended sideeffects) but since apr_proc_detach is just calling fork
and reopening the 3 std filehandles shouldn't be that difficult to work
around.

 apr_proc_fork() is described as the only call in apr that is not  
 portable.  apr_proc_create() could be used to invoke another gmond  
 process, but I'm not sure that apr guarantees to preserve the file  
 descriptors and memory allocations across that call.

apr_proc_fork() is not called by apr_proc_detach() AFAIK, indeed I was
surprised to see it even existed when noticed that apr_proc_detach calls
fork() directly.

 Maybe the problem has something to do with the way detach recycles  
 stdin/stdout/stderr?  As a quick test, could you try modifying gmond.c  
 so that it calls fork() directly rather than calling apr_proc_detach()?

fork() doesn't work because the kqueue filehandle is not inherited; using
rfork() instead doesn't either because all filehandles are closed by doing
exit(0) in the parent and so fails in the same way that changing
apr_proc_detach() does when changed to use rfork() instead.

Carlo

--
Join us December 9, 2009 for the Red Hat Virtual Experience,
a free event focused on virtualization and cloud computing. 
Attend in-depth sessions from your desk. Your couch. Anywhere.
http://p.sf.net/sfu/redhat-sfdev2dev
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.5 beta ready for final testing

2009-12-02 Thread Carlo Marcelo Arenas Belon
On Wed, Dec 02, 2009 at 11:48:51AM +, Daniel Pocock wrote:

 fork() doesn't work because the kqueue filehandle is not inherited; using
 rfork() instead doesn't either because all filehandles are closed by doing
 exit(0) in the parent and so fails in the same way that changing
 apr_proc_detach() does when changed to use rfork() instead.
   
 I'm not a BSD expert, do you know if there is any ioctl or something  
 that can be used to tell BSD to keep the file descriptors for the child  
 process?

not a BSD expert either, but I would think that would be very unlikely.

I would suggest reverting r2025 in trunk and start looking for an alternative
solution, but would be probably just easier to revert r2043 for 3.1 as well
to solve the release blocker, with the possibility of adding some logic to
the init script to try to help with the test case you were trying to prevent
by the original feature.

Carlo

PS. apache httpd must have a solution as they don't seem to have kqueue 
disabled, but that solution is probably just to delay the port binding
as was done originally (except that they manage better the failures)

--
Join us December 9, 2009 for the Red Hat Virtual Experience,
a free event focused on virtualization and cloud computing. 
Attend in-depth sessions from your desk. Your couch. Anywhere.
http://p.sf.net/sfu/redhat-sfdev2dev
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.5 beta ready for final testing

2009-11-30 Thread Carlo Marcelo Arenas Belon
On Mon, Nov 30, 2009 at 01:29:34PM +, Daniel Pocock wrote:
 Carlo Marcelo Arenas Belon wrote:

 Your call, eventhough a fix for this feature will be probably preferred as
 there is nothing special about the BSD for them to be affected and it might
 be that the problem is therefore more generic.
   
 It may be that this bug is revealing a more serious issue in the way  
 initialisation is done, so I would prefer to know the real cause rather  
 than just revert the change that forces the problem to show itself.

agree and as I said before the reason why I didn't just revert it from trunk
or 3.1 as a fix even if it seems to resolve the problem.

 At least a revert would be needed for 3.1 as this accounts for a regression
 but haven't done so either waiting for you to first revert it on trunk and
 then decide on how to proceed from there depending on how critical this
 feature was for the release.
   
 I agree that it is a recession, but reverting it may cause the real  
 culprit to remain hidden.  I'd rather hold the release while we look  
 more closely.

not sure if I understand what you meant here, since it would be obvious to
me that 3.1.5 can't be released if a fix (even if it is just reverting the
change) is committed.

are you saying you want to hold of on deciding to release or not 3.1.5 or
to see what will be in 3.1.6?, if the later I would suggest also pulling
some other fixes and of course that would also require for us to agree
on a bootstrapping environment for this release at least.

 The change has been working on Linux, Solaris and Cygwin.

 Other than just doing a manual bisect (using git instead of svn here would
 had been useful) to find where the problem was introduced and validate that
 reverting it corrects the problem haven't done much analysis of it, but the
 fact that it broke in such a strange way (was indeed expecting the culprit
 to be somewhere else, specially considering all recent changes in the
 networking and the fact that it seemed originally to be triggered by a TCP
 request) probably points to a bigger issue which just happens to have not
 been visible on the configurations used to test Linux, Solaris and Cygwin,
 specially considering how pervasive it was (broke all BSD I had access to
 test, at least)
   
 Can you provide output from strace/truss and also a stack trace from the  
 point where it is in the infinite loop?

filed BUG246 with the trace information (collected from OpenBSD 4.5 amd64)
using ktrace, but you got me there.

from the way the problem represents itself isn't really obvious were the
offending code is and is difficult to debug as well since it dissapears
when in debug mode or not running as a daemon, which is the reason why
I haven't been able to capture a backtrace yet either.

 There is a good reason for moving the daemonize code the way I did - an  
 alternative would be to daemonize, but make the original process hang  
 around until the daemon process has entered the main loop.

OK, and assume it is probably related to the cases were gmond suddenly
dies at startup without notification but some clarification on what was
the problem you were trying to solve would be probably usefull too.

Carlo

--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.5 beta ready for final testing

2009-11-29 Thread Carlo Marcelo Arenas Belon
On Tue, Nov 24, 2009 at 06:03:51PM -0800, Bernard Li wrote:
 
 Please help us test on as many OS/archs as possible, as this would go
 GA quite immediately ;-)

FreeBSD is not able to return any XML data through TCP/8649 (tested with
FreeBSD 8.0 amd64).

DragonFlyBSD fails to build but a 3.2 version of ganglia which includes
fixes for that fails with the same TCP issue than FreeBSD and so this
issue might be affecting other BSD as well.

Carlo

--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] multicast source address on Solaris

2009-11-28 Thread Carlo Marcelo Arenas Belon
On Wed, Nov 25, 2009 at 10:57:37AM +, River Tarnell wrote:
 Carlo Marcelo Arenas Belon:
  
  have you added mcast_if to both the udp_send_channel and udp_recv_channel
  configurations?
 
 i only had it for send_channel.  i've added it to recv_channel as well,
 but it seems to make no difference.

you need to have it added for both (unless you are routing multicast
packets between both interfaces) or you won't be able to see your own
metrics.

 thanks,

the attached draft patch should correct the problem, please disregard all
instructions about adding routes as I noticed later that I had misunderstood
your report and was therefore assuming the problem was you couldn't send
the multicast traffic through the interface and not that it was just going
with a incorrect source IP.

Carlo
BZh91AYSY5?H_?{?z??~??P??f?(q??CM??...@??h?l 4?4?hq...@b?j4d?q??z???4?0??#?m?=ocp?M??	???L???A???!	????=?hh???55???G!4t??y?/?[^r???am+|e?b?M?#'6?a
@(???^??%?:??4??}???%Id?t??	h?	?3	???h??j??????...@?@0?D{`|+y?J?o???,?:N m?I?
?UY
K???e?\aJ??
EU?e?/`????j	??L??\pgh?Lm??c3???IU#I??^??n9;D?	0?u????????
k?WX?u`1:???RC?!?)9?????s?8?cf6??]???Bc(?}z	1?L-??P~,c(/@??+O?F??2?}?e*w_?
(w?g?y?0?dB?'\??0pb-????!WO?? ^`lG?TZ???~?!?P????aa??F???]??J?D??!?^?2????...@???c?f?d?@???t?????*?B?!4?$u???/?L????fO}?{)?
?? ??
a?P???9??4???] ?[?-??-'?X G=?j2\!?a	...@j)i???*e?$0?~???B?p??5?B'S?q???#W??Mxb??e???^??g???9!?Q[ X
?.?hM444???V?!$*???
?t?Z??rV$j??nv?wP?
?DK?rE8P?5?H--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] multicast source address on Solaris

2009-11-25 Thread Carlo Marcelo Arenas Belon
On Wed, Nov 25, 2009 at 07:45:46AM +, River Tarnell wrote:
 
 Carlo Marcelo Arenas Belon:
  FYI gmond in Solaris 10 zones has problems as tracked by BUG100 :
 
 there are no zones configured on the system.  these are IP Multipathing
 interfaces.

OK, and the interfaces you are trying to use are physical interfaces?
what do you get with :

  # dladm show-link

are you expecting to send the multicast traffic through a specific VLAN?

   by default, gmond (or Solaris?) seems to choose the first available
   interface for the source address, which is 172.16.0.129.
  
  AFAIK Solaris does by adding a route to 224.0/4 through that interface
  as shown by `netstat -rn`
  
 i see no route for 224/4 on this system:
 
 Routing Table: IPv4
   Destination   Gateway   Flags  Ref Use Interface 
 -   - - -- - 
 default  91.198.174.193   UG1  92511   
 10.24.1.010.24.1.24   U 1   6471 bge301000 
 10.24.1.010.24.1.25   U 1   2294 bge301001 
 10.24.1.010.24.1.24   U 1  0 bge301000:1 
 10.24.1.010.24.1.25   U 1  0 bge301000:2 
 10.24.1.010.24.1.25   U 1  0 bge301001:1 
 10.24.1.010.24.1.25   U 1  0 bge301001:2 
 10.24.1.010.24.1.25   U 1  0 bge301001:3 
 91.198.174.192   91.198.174.204   U 1   7843 bge102000 
 91.198.174.192   91.198.174.208   U 1893 bge102001 
 91.198.174.192   91.198.174.204   U 1  0 bge102000:1 
 91.198.174.192   91.198.174.208   U 1  0 bge102000:2 
 91.198.174.192   91.198.174.208   U 1  0 bge102001:1 
 91.198.174.192   91.198.174.204   U 1  0 bge102001:2 
 172.16.0.128 172.16.0.129 U 1   1003 nge0  
 172.16.1.0   172.16.1.1   U 1   1020 nge1  
 172.16.4.0   172.16.4.1   U 1742 clprivnet0 
 239.2.11.71  91.198.174.204   UGH   1  0   
 127.0.0.1127.0.0.1UH   191055435 lo0   
 
 curiously, the route is present on our other (single-interface) Solaris
 systems.
 
# route delete -interface 224.0/4 -gateway 172.16.0.129
# route add -interface 224.0/4 -gateway 91.198.174.204
  
 damiana# route delete -interface 224.0/4 -gateway 172.16.0.129
 delete net 224.0/4: gateway 172.16.0.129: not in table
 damiana# route add -interface 224.0/4 -gateway 91.198.174.204
 add net 224.0/4: gateway 91.198.174.204
 
 after restarting gmond, it's still sending packets from 172.16.0.129.

does it show that is subscribed then in that interface and snoop see the
packets coming out from that interface as well?

 # netstat -g

have you added mcast_if to both the udp_send_channel and udp_recv_channel
configurations?, if it still doesn't work, could you get the output from the
first seconds of :

 # gmond -d9

and a truss of the same?

Carlo 

--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia 3.1.5 beta ready for final testing

2009-11-24 Thread Carlo Marcelo Arenas Belon
On Tue, Nov 24, 2009 at 09:05:12PM -0800, Bernard Li wrote:
 
 I should clarify, what I meant to say was:
 
 The resulting tarball should build fine on Solaris without Python support
 
 I don't believe this is a regression from previous release(s)

not really; 3.1.4 builds fine with python support in Solaris 10 as shown in
the thread you linked to :

  
http://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg05098.html

the new bootstrapping might also affect Solaris 8 or older releases, which is
why I don't think that expediting this release will make much sense, specially
considering that there are almost no benefits IMHO.

Carlo

--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] multicast source address on Solaris

2009-11-24 Thread Carlo Marcelo Arenas Belon
On Wed, Nov 25, 2009 at 06:25:37AM +, River Tarnell wrote:
 
 i'm running gmond 3.1.5 on a Solaris 10 system with several interfaces:

FYI gmond in Solaris 10 zones has problems as tracked by BUG100 :

  http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=100

 by default, gmond (or Solaris?) seems to choose the first available
 interface for the source address, which is 172.16.0.129.

AFAIK Solaris does by adding a route to 224.0/4 through that interface
as shown by `netstat -rn`

 however, i
 would prefer it to use 91.198.174.204 on bge102000.  i first tried
 adding a route for the multicast destination:
 
 damiana# route -p add -host 239.2.11.71 91.198.174.204
 
 but that made no difference, so i tried this instead:
 
 damiana# route -p add -host 239.2.11.71 91.198.174.204 -interface

what happens if you do instead :

  # route delete -interface 224.0/4 -gateway 172.16.0.129
  # route add -interface 224.0/4 -gateway 91.198.174.204

 which still didn't work.  i tried adding 'mcast_if = bge102000' to
 udp_send_channel.  this made no difference either.

this is a bug, most likely but might be expected if bge102000 is a zone
interface instead of real hardware as explained on BUG100.

Carlo

--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.4 beta ready for testing

2009-11-03 Thread Carlo Marcelo Arenas Belon
On Mon, Nov 02, 2009 at 03:05:32PM -0800, Bernard Li wrote:
 
 Can you please test this tarball bootstrapped on Fedora 9.

It works, but would invalidate all testing that was done for 3.1.3
and the original 3.1.4.

 If it works I will replace the original tarball with this:
 
 http://ganglia.info/testing/bootstrapped_on_fedora9/ganglia-3.1.4.tar.gz

-1

Changing the release package in the middle of a release is a bad idea;
indeed changing it without bumping the release version goes against our
release procedures, as it could result in different binary packages and was
the reason why the unofficial package I provided was published far
from the ganglia servers to hopefully avoid any confusion and frustration
if it was found later that someone finds a bug which happens to be only
reproducible in the other version.

There is also the risk of introducing a bug (like the one in 3.1.2 from
bootstrapping in SuSE with automake 1.9.6 which prevented users that had
the 32bit libraries for apr installed on 64bit systems to get a working
build) and so as much as I am excited about finally moving to some more
modern versions of autotools, this make only sense as part of 3.1.5, and
which will hopefully also allow for enough time to remove all needed hacks
and finally cleanup the bootstrapping code.

Carlo

--
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.4 beta ready for testing

2009-11-03 Thread Carlo Marcelo Arenas Belon
On Mon, Nov 02, 2009 at 11:09:40PM +, Daniel Pocock wrote:
 
 I note Paul is using gcc, whereas I'm building and testing with Sun 
 Studio on the OpenCSW build farm - Sun's compiler is now a free 
 download, and it is used to build all the CSW libraries (including those 
 used by Ganglia), so this is now the easiest solution to support - that, 
 and Solaris 8 support, led me to tweak the configure.in stuff for 
 Solaris - maybe it needs more tweaking to support gcc - would anyone 
 like to comment on the preferred gcc build environment to be supported?

IMHO any gcc should work, and indeed gcc was the originally supported
compiler for ganglia in Solaris (Sun Studio was added later in 3.1.1 when
it was made freely available with OpenSolaris).

while working on libmetrics (as can be seen in the corresponding metrics.c
file) the following versions of gcc were used (most of them using SUNWtoo
and other SUNW provided tools as part of the toolchain when possible) :

  Solaris 7 x86 (32-bit) with gcc-2.8.1 (this one used GNU binutils AFAIK)
  Solaris 8 (64-bit) with gcc-3.3.1
  Solaris 9 (64-bit) with gcc-3.4.4
  Solaris 10 SPARC (64-bit) and x86 (32-bit and 64-bit) with SUNWgcc

 On the issue of the gcc environment, we basically need a second version 
 of scripts/build-solaris.sh for gcc - this raises questions like should 
 the libraries (apr, confuse) be built with gcc too?  Which ld, ar, etc?

This is IMHO a packager call after all we don't provide binaries (well we
do but almost no one uses them) because as you pointed out the decision on
which toolchain to use needs to be made at the distribution or system
engineering level and so we are left to support them all the best we can.

In cases were there is some overlap (like in the case of the CSW packages,
where the package maintainers are also upstream contributors) or when it
helps to simplify maintenance on a specific platform (like the CentOS
4 RPMs or the Makefile.WiX recipes for Cygwin) then it makes sense to
have some additional code to help with it and also some more testing
or confidence about the resulting binaries working as expected, but that
shouldn't be ever considered as the only supported solution IMHO.

Carlo

--
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia 3.1.4 beta ready for testing

2009-10-29 Thread Carlo Marcelo Arenas Belon
On Tue, Oct 27, 2009 at 09:52:52AM +, Paul Sobey wrote:

 /usr/include/sys/feature_tests.h:336:2: error: #error Compiler or options 
 invalid; UNIX 03 and POSIX.1-2001 applications   require the use of 
 c99
 make[2]: *** [getopt1.o] Error 1
 
 Googling leads me to try compiling with CFLAGS=-std=gnu99 per:
 
 http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=215

this is a bug on the autoconf from CentOS 4 which is used to build the
release packages, therefore you can also workaround the issue by
rebootstrapping the package or making your own with a better version
of the autotools.  for simplicity I'd uploaded an unofficial release
package for 3.1.4 bootstrapped on fedora rawhide in :

  http://sajino.sajinet.com.pe/ganglia/ganglia-3.1.4.tar.gz

 If do that, compilation fails building against Python 2.6.2 (built with 
 same toolchain):

once you use -std=gnu99 is no longer the same toolchain and therefore
building python with the same standard support should solve your problem.

Carlo

--
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia 3.1.4 beta ready for testing

2009-10-29 Thread Carlo Marcelo Arenas Belon
On Thu, Oct 29, 2009 at 04:44:59PM +, Paul Sobey wrote:
 I note from the Makefile Daniel posted:
 
 # Depends: some issues exist getting the Python support working on 
 Solaris,
 # Ganglia's configure.in needs to be further enhanced for this to work

I think this is a CSW specific problem, as I had no problem getting
python support compiled in Solaris 10u7 x86 using SUNWPython-devel, SUNWgcc,
SUNWlexpt and compiled versions of confuse and apr.

  $ PATH=$PATH:/usr/sfw/bin:/usr/ccs/bin
  $ ./configure CC=gcc -std=gnu99 --prefix=/usr/local 
--with-libarp=/usr/local/apr/bin/apr-1-config --with-libconfuse=/usr/local
  $ make

Daniel, could you elaborate?

Carlo

--
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.1.4 beta ready for testing

2009-10-29 Thread Carlo Marcelo Arenas Belon
On Thu, Oct 29, 2009 at 08:42:05PM +, Daniel Pocock wrote:
 
 Carlo Marcelo Arenas Belon wrote:
  On Thu, Oct 29, 2009 at 04:44:59PM +, Paul Sobey wrote:

  I note from the Makefile Daniel posted:
 
  # Depends: some issues exist getting the Python support working on 
  Solaris,
  # Ganglia's configure.in needs to be further enhanced for this to work
 
  Daniel, could you elaborate?
 
 Although I have described the Python module in the CSW Makefile, it is 
 not something I have properly tested.

OK and I haven't done any testing either, other than making sure it builds
and that a mod_example like module can be loaded, but my question was more
about the need to change configure.in to support python modules which you
were referring about in the Makefile as Paul noted.

 I am still working through some 
 core agent problems (e.g. see the discussion on csw-maintainers about 
 building a 64 bit version of everything: I've noticed that when running 
 a 32 bit binary on some 64 bit machines with lot's of RAM, some kstat 
 calls lead to a seg fault)

care to provide a link to the thread or any bug reports?, earlier releases
for 3.0 required 64bit binaries as they were reading kernel memory directly
to gather the statistics, but after those metrics were migrated to kstat
that shouldn't be an issue anymore, and I am running some 32-bit 3.0 agent
with solaris sparc with significant amount of memory as well, so there
might be a regression to track here.

Carlo

--
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia 3.1.4 beta ready for testing

2009-10-29 Thread Carlo Marcelo Arenas Belon
On Thu, Oct 29, 2009 at 01:10:14PM -0700, Bernard Li wrote:
 On Thu, Oct 29, 2009 at 12:01 PM, Carlo Marcelo Arenas Belon
 care...@sajinet.com.pe wrote:
 
  this is a bug on the autoconf from CentOS 4 which is used to build the
  release packages, therefore you can also workaround the issue by
  rebootstrapping the package or making your own with a better version
  of the autotools. ?for simplicity I'd uploaded an unofficial release
  package for 3.1.4 bootstrapped on fedora rawhide in :
 
  ?http://sajino.sajinet.com.pe/ganglia/ganglia-3.1.4.tar.gz
 
 Do you have a link for the bug, and are you aware whether there are
 updates for CentOS 4 to fix the issue?

I am not aware of a CentOS or RHEL bug report, but considering that EL4
is in maintenance mode there won't be a fix anyway (2.59 was released
in 2003 and the last update to package was in 2004)

 I guess I could start building on CentOS 5, provided that the autoconf
 does not have this bug.

CentOS 5 also uses autoconf 2.59 so wouldn't help with this problem, but
might hopefully allow us remove all the kludges that were added to workaround
the libtool 1.5.6 bugs which were preventing DragonFlyBSD support.

Ideally, which platform is used to bootstrap shouldn't be relevant though
and IMHO we should be instead aiming to the latest versions of the autotools
(either installed by hand or provided as part of the distribution if more
development focused) and for that when on Linux usually means Fedora, Gentoo
or Debian IMHO.

Carlo

--
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia 3.1.4 beta ready for testing

2009-10-27 Thread Carlo Marcelo Arenas Belon
On Mon, Oct 26, 2009 at 04:51:33PM -0700, Bernard Li wrote:
 
 Ganglia 3.1.4 is ready for testing at:
 
 http://ganglia.info/testing/

DragonFlyBSD fails to build (tested with 2.4.0 32bit).

not a regression (a system header problem which also affects
3.1.2) and there are some trivial unrelated changes in trunk
which could help with that.

Carlo 

--
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] in which file do gmetad writes information regarding total cpu load?can it be fetched or read?

2009-09-07 Thread Carlo Marcelo Arenas Belon
On Mon, Sep 07, 2009 at 12:19:26AM +0530, pankaj dorlikar wrote:

each metric (cpu_load included) is stored by gmetad in an rrd file
which can be manipulated further using rrdtool.

there is 1 directory per each node monitored in a directory structure
plus 1 additional directory per each summarization domain inside the
directory that rrd_rootdir in gmetad.conf points to.

Carlo

--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] ganglia shows cluster down

2009-07-28 Thread Carlo Marcelo Arenas Belon
On Tue, Jul 28, 2009 at 10:11:33AM +0200, Igor Rosenberg wrote:

 I'd been witnessing very long updates when adding/restarting nodes, and 
 wondered
 why sometimes it took like 15 minutes for the infrastructure to reflect the 
 changes.

it should be closer to 20 min as mentioned in the release notes (important 
notes)

  http://ganglia.wiki.sourceforge.net/ganglia_release_notes

 I'm using the ganglia-3.1.2 release (on Debian 4.0 machines). Did the 
 (possible)
 fix you mention reach that version?

there is no fix AFAIK, if you are using unicast using send_metadata_interval is
the only available workaround.

on either case and unless the protocol is changed in a probably incompatible way
(hence not to be seen in 3.1.x), restarting gmond in the right order (and all
of them per cluster) is the only available workaround for quickly updating
infrequently polled metrics for in the 3.1 branch.

Carlo

--
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] embedded metric vs. libganglia?

2009-05-21 Thread Carlo Marcelo Arenas Belon
On Wed, May 20, 2009 at 02:41:49PM -0700, Christopher Smith wrote:
 The Java one is actually pretty nice now, but I just didn't see why the C
 one would even be implemented in the first place given that the main project
 *already has* what is needed.

My guess is that it was implemented as a way to isolate the users from
API/ABI changes in libganglia since there is currently no public stable
interface for it.

practically speaking though the interface is pretty much not changing
all that much anyway (3.0.7 to 3.1.0 was the only one I can recall).

 So to clarify: I shouldn't have any problems using libganglia to add metrics
 to my apps right?

as far as you are OK on linking with libganglia and correcting any changes
that might be needed if the API changes IMHO, and which might as well apply
to the alternatives.

Carlo

--
Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT
is a gathering of tech-side developers  brand creativity professionals. Meet
the minds behind Google Creative Lab, Visual Complexity, Processing,  
iPhoneDevCamp asthey present alongside digital heavyweights like Barbarian
Group, R/GA,  Big Spaceship. http://www.creativitycat.com 
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia 3.1.2 -- Modules fail to load intermittently, Hosts disappear from the web interface, XML errors that can't be found

2009-05-15 Thread Carlo Marcelo Arenas Belon
On Thu, May 14, 2009 at 11:47:41AM -0500, Adam Tygart wrote:

 I have been having a hack of a time diagnosing this problem.

I suspect there are several problems here, which OS and architecture?

 I recently updated to ganglia-3.1.2 for 3.0.7.

3.1 and 3.0 are not compatible and can't be on the same cluster, so for
this upgrade to be successfull you should have done :

  1) upgrade your gmetad/web to 3.1.2
  2) upgrade all gmond to 3.1.2, cluster by cluster in batches

more details to be found in :

  http://ganglia.wiki.sourceforge.net/ganglia_release_notes

 Since then I have been
 plagued with (what looked like) data errors, mis-reporting swap usage
 was the easiest to see.

could you elaborate here?, is the value that gmond is collecting on each
node incorrect?, is the agregated in gmetad incorrect?, which one of the
swap metrics is incorrect?

# uname -a
Linux dell 2.6.28-gentoo-r5 #1 SMP Thu Apr 23 21:35:08 PDT 2009 x86_64 Intel(R) 
Core(TM)2 CPU 6320 @ 1.86GHz GenuineIntel GNU/Linux
# gmond --version
gmond 3.1.2
# telnet 127.0.0.1 8649 | grep swap
METRIC NAME=swap_total VAL=4008176 TYPE=float UNITS=KB TN=60 
TMAX=1200 DMAX=0 SLOPE=zero
EXTRA_ELEMENT NAME=DESC VAL=Total amount of swap space displayed in KBs/
Connection closed by foreign host.
METRIC NAME=swap_free VAL=4008176 TYPE=float UNITS=KB TN=60 
TMAX=180 DMAX=0 SLOPE=both
EXTRA_ELEMENT NAME=DESC VAL=Amount of available swap memory/
# free | grep Swap
Swap:  4008176  04008176

 This seems to be caused by some reporting
 modules failing to load. They fail silently, I don't see logs about it
 anywhere, and when I turn debugging on I still don't see anything.

AFAIK if a module fails to load because of an error it will just prevent
gmond to start at all (some times silently) as detailed in the Known Issues.

if the module is not loaded but it is still referred by the configuration
for collecting it will also be very noisy about it :

# /etc/init.d/gmond start
 * Starting GANGLIA gmond:  ...
Cannot locate internal module structure 'mem_module' in file (null): 
/usr/sbin/gmond: undefined symbol: mem_module
Possibly an incorrect module language designation [(null)].
  [ ok ]
# tail /var/log/syslog | grep gmond
May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
information for 'mem_total'. Possible that the module has not been loaded.  
May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
information for 'swap_total'. Possible that the module has not been loaded.  
May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
information for 'mem_free'. Possible that the module has not been loaded.  
May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
information for 'mem_shared'. Possible that the module has not been loaded.  
May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
information for 'mem_buffers'. Possible that the module has not been loaded.  
May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
information for 'mem_cached'. Possible that the module has not been loaded.  
May 15 01:53:23 dell /usr/sbin/gmond[13374]: Unable to find the metric 
information for 'swap_free'. Possible that the module has not been loaded.

what makes you think the module is not being loaded?, and that is being
silent about that?, does it show in? :

  # lsof -p `pidof gmond` | grep ganglia

 Usually it is one of the modules, but I have had two occasionally
 happen at the same time. modmem.so and modnet.so are the two to most
 commonly fail.

what is observed when they fail to load?

 I have restarted with a new gmond configuration, changing only the
 configuration of multicast to unicast, and this problem persists.

this might have introduced another problem, for unicast to work somehow
reliably you need to add a value for send_metadata_interval.

 I have wiped my old rrd data. I have tried everything I know that could
 even remotely be to blame for this problem.
 
 The question I have is this: is this a known bug?

some are, like the unicast send_metadata_interval or the cpu_count
inconsistency as shown by the Important Notes, some others might not be

 Is there something else I should try?

rollback to 3.0, specially if you don't need the modules but want a more
stable setup.

 Can I force a module to be loaded?

no, but a module should never fail to load silently AFAIK

 When the modules do load, hosts report to gmond, and gmeta grabs that
 data and logs it. My webserver then serves up the data through the
 ganglia interface. The problem I am having here is that I get
 intermittent xml errors, mostly saying that there is a missing  on
 line $SomeLineNumber (always changes). Happens every 15 minutes or so.
 I cannot reproduce any problems with the xml, however. I ran xmllint
 on the xml 1 per second for an hour with no errors, during which time
 the web interface 

Re: [Ganglia-general] Ganglia 3.1.2 -- Modules fail to load intermittently, Hosts disappear from the web interface, XML errors that can't be found

2009-05-15 Thread Carlo Marcelo Arenas Belon
On Fri, May 15, 2009 at 08:42:33AM -0500, Adam Tygart wrote:
 On Fri, May 15, 2009 at 04:32, Carlo Marcelo Arenas Belon
 care...@sajinet.com.pe wrote:
  On Thu, May 14, 2009 at 11:47:41AM -0500, Adam Tygart wrote:
 
  Since then I have been
  plagued with (what looked like) data errors, mis-reporting swap usage
  was the easiest to see.
 
  could you elaborate here?, is the value that gmond is collecting on each
  node incorrect?, is the aggregated in gmetad incorrect?, which one of the
  swap metrics is incorrect?
 
 Aggregate swap data being incorrect is the easiest to see.
 Here is the graph from a mis-reporting host (it doesn't always even
 send this information): http://imgur.com/io8gu.png
 
 Here is the resulting aggregate graph: http://imgur.com/trato.png
 The beginning of this graph is showing the correct data, I simply
 restarted gmond (on all non-webserver hosts), and the resulting swap
 usage was from one of them failing to send the correct data.

OK, the metric value is not incorrect, but is not being reported at all
which is why you have dips on your graph that fix themselves after several
minutes.

This is sadly a known issue, because of the way that gmond register metrics
dynamically and the fact that some of those metrics aren't refreshed that
frequently as described in the Release Notes (mentioning as an example the
CPU count issues which is very visible), for more details in the discussion
look at :

  
http://www.mail-archive.com/ganglia-general@lists.sourceforge.net/msg04275.html
   
An eventhough I agree it is a bug doesn't have yet a solution, and is not
seen unless gmond is restarted (any of them)

a workaround is available, but ensuring that if you have to restart a gmond
you restart first its collector (the one that gmetad is looking at) and the
rest are pointing to when using unicast, and restart ALL other gmond in the
cluster after that.

  The question I have is this: is this a known bug?
 
  some are, like the unicast send_metadata_interval or the cpu_count
  inconsistency as shown by the Important Notes, some others might not be
 
 I haven't been able to find the Important Notes document, is there a
 link to this somewhere?

sadly it is buried at the bottom of the Release Notes now :

  http://ganglia.wiki.sourceforge.net/ganglia_release_notes

and yes I agree should be moved to a better place as well.

 Is the cpu_count inconsistency the piece I mentioned about hosts
 disappearing from the web interface?

most likely the host disappearing from the web interface is because of
the send_metadata_interval and you trying to restart the gmond to fix it.

if it is not then we have a new bug ;)

  Is there something else I should try?
 
  rollback to 3.0, specially if you don't need the modules but want a more
  stable setup.
 
 This being Gentoo, I have no easy way of rolling back, as the 3.0.x
 builds have been removed from their tree.

OK, IMHO having ganglia 3.0 in their tree as well with a different slot
might be a good idea, but sadly I haven't yet filed it as a bug or can
provide a working ebuild in a public overlay yet as a solution either,
but of course you can still do your own binaries/packages if needed.

3.0 is still under development with 3.0.8 going to be released sometime soon
and future releases focusing mainly on stability and compatibility with 3.1,
as well as supporting all other architectures that are not yet working in
3.1.

 The whole reason I upgraded was because I wanted to make use of the
 python module support. I was previously using gmetric for monitoring
 things like PBS job count and temperature on my nodes. After a week or
 two of those scripts running, the load average on the systems started
 to climb. After a month, the load average increase caused by gmetrics
 was are 2-4 per host. A full 10% of my cluster's CPU utilization was
 caused by gmetrics alone (all system cpu).

most likely the spawn/fork cost and the fact that they were done with too
much frequency, 3.1 modules might be a good solution for that, but if the
metric collection is expensive anyway (and I would assume it is as I have
never seen that much consumption from my own gmetric which are executed
every second) then you are not going to solve the problem by just moving
that expensive operation into gmond.

Carlo

--
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables 
unlimited royalty-free distribution of the report engine 
for externally facing server and web deployment. 
http://p.sf.net/sfu/businessobjects
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Unable to write root epilog / missing metric information

2009-05-11 Thread Carlo Marcelo Arenas Belon
On Fri, May 08, 2009 at 12:15:19PM -0700, Jeff Orr wrote:
 Carlo Marcelo Arenas Belon wrote:
  On Thu, May 07, 2009 at 03:42:44PM -0700, Jeff Orr wrote:

  Upon the advice of Carlo, I updated the machine to Ganglia 3.0.7-1 with
  RRDtool rrdtool-1.2.30-1.rh9.rf (the furthest I could take FC5 without
  horrible dependency chasing). No dice. The CGI page renders correctly
  for about 1 minute after gmetad stop+start. Then blank page plus Root
  epilog error again.
  
 
  which CGI page?, any interesting messages in the error log for your
  webserver?

 This is the page provided by ganglia-web... /ganglia/index.php version
 3.0.7-1

and all cluster specific pages work fine then?, it is only the grid one
that is broken?

is there any error in your web (apache) logs?

 outputs
 server_thread() received request /?filter=summary from 127.0.0.1 -

you said it works fine for a while and then breaks, right?
when it breaks the same query could be used to reproduce the issue?

  $ echo /?filter=summary | netcat localhost 8652

  could you check the ouput of the gmetad request that is failing to see
  why/where it is getting aborted?, could it be there is some invalid
  character in the XML?, 

 Unlikely, but unknown at this point. Firefox renders the XML all right,
 but that doesn't rule out illegal UTF-8 characters.

to validate xml, sending the output of the previous netcat command to a file
and running

  $ xmllint output.xml

should return (echo $?) 0 if it is valid, and if not will pinpoint the source
of the problem, and hopefully give us enough information to get a workaround
and a fix.

Carlo

--
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image 
processing features enabled. http://p.sf.net/sfu/kodak-com
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Help: libconfuse problem

2009-05-10 Thread Carlo Marcelo Arenas Belon
On Fri, May 08, 2009 at 09:08:10PM +0800, Lee Amy wrote:
 
 Anyone have good idea on such problems?

use the binaries from :

  http://sajino.sajinet.com.pe/ganglia/epel/EL-5/i386/

which are just ready to use rebuilt packages from fedora for Red Hat/CentOS
5 in x86.

Carlo

--
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image 
processing features enabled. http://p.sf.net/sfu/kodak-com
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] 3.1.2 compile fails (3.1.1 compile o.k.)

2009-05-10 Thread Carlo Marcelo Arenas Belon
On Fri, May 08, 2009 at 10:30:31AM -0400, Justin R. Davis wrote:
 
 OK, I finally figured out how to get 3.1.2 to compile.  If I use the 
 aclocal.m4 file from 3.1.1 in my 3.1.2 unmodified source, I am able to
 compile and run ganglia successfully.

I was afraid this might be the problem based on your original report
and since as I mentioned originally there were no code changes between
3.1.1 and 3.1.2 on the build code that could explain this.

3.1.2 was bootstrapped with newer/different versions of the autotools
that are used for all releases (the ones provided by CentOS 4) though
and that is where this aclocal.m4 comes from.

another workaround (and the one I was testing in CentOS 5 x86_64) was
to remove the 32bit version of libraries which conflicted :

  # rpm -ev expat-devel.i386
  # rpm -ev apr-devel.i386

 This made me wonder...Isn't this file supposed to be generated?  So I 
 deleted the aclocal.m4 file that is currently distrubted with 3.1.2 and 
 tried building again...and viola!  it works fine.  When completed, I 
 diff'ed the aclocal.m4 file which was created and it is now identical to 
 the 3.1.1 version.

for this to work you need to have automake installed, and will basically
force a rebootstrapping of the package using it (hence also require libtool
and the other packages required for bootstrapping like autoconf).

instead of removing aclocal.m4 you can explicitally rebootstrap if that
is the case by running :

  # autoreconf

 So if there are any developers out there...you might want to look into 
 this...

note taken by adding it to the release notes :

  http://ganglia.wiki.sourceforge.net/ganglia_release_notes

won't be a problem for 3.1.3; promise.

Carlo

--
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image 
processing features enabled. http://p.sf.net/sfu/kodak-com
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] 3.1.2 compile fails (3.1.1 compile o.k.)

2009-05-07 Thread Carlo Marcelo Arenas Belon
On Thu, May 07, 2009 at 10:23:22AM -0400, Justin R. Davis wrote:

 When I try and compile 3.1.2 I run into a problem linking with the expat  
 library.

do you have both a 32bit expat (/usr/lib) and 64bit expat (/usr/lib64)?
which distribution/version?

 Ganglia's libtool want to use the version in /usr/lib although
 I believe it should be using the version in /usr/lib64

that is the problem, ganglia's configure is sadly not that smart as it
was only starting with 3.1.0 that it uses external library dependencies
and for whatever reason your setup is getting it confused.

 If I try and compile 3.1.1 on the same computer, using the same configure
 arguments, it builds fine:

this is therefore a regression, at least for your configuration, for that
the output of your config.log might be needed, so feel free to send it
directly (as it might be too big for the list), or attach it to a bug
report with all other relevant information.

since there shouldn't had been any changes between 3.1.1 and 3.1.2 that
could trigger this AFAIK, would be also interesting to see what the working
3.1.1 log shows.

if you could extract the relevant information for this thread, even better
as that would help other people later that have your same problem.

 Any idea how this issue can be fixed in the 3.1.2 version?

as a workaround you might be able to use :

  ./configure --libdir=/usr/lib64

Carlo

--
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image 
processing features enabled. http://p.sf.net/sfu/kodak-com
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Help: libconfuse problem

2009-05-07 Thread Carlo Marcelo Arenas Belon
On Thu, May 07, 2009 at 10:52:35PM +0800, Lee Amy wrote:

 /usr/bin/ld: /usr/local/lib/libconfuse.a(confuse.o): relocation
 R_X86_64_32 against `a local symbol' can not be used when making a
 shared object; recompile with -fPIC
 /usr/local/lib/libconfuse.a: could not read symbols: Bad value

you are trying to build a non-static version of ganglia linked against
a statically compiled libconfuse, you could either :

1) rebuild your libconfuse as a dynamic library (off by default)

  confuse $ ./configure --enable-shared

  EPEL packages are dynamic and could be used instead of a custom build library
  as well in your setup AFAIK :

http://download.fedora.redhat.com/pub/epel/5/i386/repoview/libconfuse.html

2) build ganglia statically (would also build all modules statically) :

  ganglia $ ./configure --enable-static-build

Carlo

--
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image 
processing features enabled. http://p.sf.net/sfu/kodak-com
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Unable to write root epilog / missing metric information

2009-05-07 Thread Carlo Marcelo Arenas Belon
On Thu, May 07, 2009 at 10:43:46AM -0700, Jeff Orr wrote:
 Carlo Marcelo Arenas Belon wrote:
 
  I'd suggest you upgrade to 3.0.7 compiled against a newer version of
  rrdtool to see if the problem goes away as well for you and if not we
  might need to reopen that bug and do more investigation on the source of it
 
  in any case you want to upgrade your gmetad, since it has a vulnerability
  as explained in :
 
http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=223
 
  so don't forget to also apply the following patch on top of it :
 
http://bugzilla.ganglia.info/cgi-bin/bugzilla/attachment.cgi?id=189

 So, should I patch 3.0.7 and upgrade, or is the upgrade to 3.0.7 sufficient?

upgrade to 3.0.7 and apply the patch, this will be included in 3.0.8 when
it is released and is also included (sorta) in 3.1.2.

 
  3.0 and 3.1 are not compatible at the network/configuration layer and so
  you can't upgrade gmond to 3.1 if you are still using 3.0 in the same 
  cluster
  and if you update all the gmond in a cluster to 3.1 you have to use a new
  configuration as described in the release notes.
 
  since gmond wasn't your problem here, I'd suggest you better rollback this
  update for now, and focus in the gmetad problem.
 
 It was my understanding that 3.1.x and 3.0.x could communicate, i.e. a
 3.0.x cluster could talk to a master running 3.1.x. I distinctly
 remember it on the main page for configuring 3.1. Oh well...

yes, a 3.1 gmetad can talk to a 3.0 gmond or a 3.1 gmond, as far as they are
not part of the same cluster, my suggestion was just to simplify your setup
and number of changes so that you could get a better grip on the problem.

 I'll try 3.0.7 with new RRDtool and see how it goes.

3.0 links RRDtool statically, so you don't really need an RRDtool package
installed but only the static library (built from source and installed) in
the machine where you are building your new gmetad.

Carlo

--
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image 
processing features enabled. http://p.sf.net/sfu/kodak-com
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Unable to write root epilog / missing metric information

2009-05-07 Thread Carlo Marcelo Arenas Belon
On Thu, May 07, 2009 at 03:42:44PM -0700, Jeff Orr wrote:
 Upon the advice of Carlo, I updated the machine to Ganglia 3.0.7-1 with
 RRDtool rrdtool-1.2.30-1.rh9.rf (the furthest I could take FC5 without
 horrible dependency chasing). No dice. The CGI page renders correctly
 for about 1 minute after gmetad stop+start. Then blank page plus Root
 epilog error again.

which CGI page?, any interesting messages in the error log for your
webserver?

could you check the ouput of the gmetad request that is failing to see
why/where it is getting aborted?, could it be there is some invalid
character in the XML?, do you have anything else other than the web
frontend talking to this gmetad?, if the frontend is stopped do any of
the following commands reproduce the problem?

  $ telnet localhost 8651
  $ echo / | netcat localhost 8652

Carlo

--
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image 
processing features enabled. http://p.sf.net/sfu/kodak-com
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Unable to write root epilog / missing metric information

2009-05-06 Thread Carlo Marcelo Arenas Belon
On Wed, May 06, 2009 at 03:38:34PM -0700, Jeff Orr wrote:
 
 We were using 3.0.4.fc5, so I tried updating to 3.1.2 to see if it fixes
 the problem.

this was reported before as a bug in 3.0 which magically went away
somewhere in the last versions of 3.0 (it might had been because of using
newer versions of rrdtool as well).

  http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=42

I'd suggest you upgrade to 3.0.7 compiled against a newer version of
rrdtool to see if the problem goes away as well for you and if not we
might need to reopen that bug and do more investigation on the source of it

in any case you want to upgrade your gmetad, since it has a vulnerability
as explained in :

  http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=223

so don't forget to also apply the following patch on top of it :

  http://bugzilla.ganglia.info/cgi-bin/bugzilla/attachment.cgi?id=189

 Now in addition, we are getting these messages on gmond startup

3.0 and 3.1 are not compatible at the network/configuration layer and so
you can't upgrade gmond to 3.1 if you are still using 3.0 in the same cluster
and if you update all the gmond in a cluster to 3.1 you have to use a new
configuration as described in the release notes.

since gmond wasn't your problem here, I'd suggest you better rollback this
update for now, and focus in the gmetad problem.

Carlo

--
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image 
processing features enabled. http://p.sf.net/sfu/kodak-com
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Is there any documentantion about libganglia ??

2009-05-05 Thread Carlo Marcelo Arenas Belon
On Tue, May 05, 2009 at 11:45:15AM -0300, ricardo figueiredo wrote:
 
 I would like know if there is some documentation about libganglia.

no, and doesn't have an stable ABI either yet.

Carlo

--
The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your
production scanning environment may not be a perfect world - but thanks to
Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700
Series Scanner you'll get full speed at 300 dpi even with all image 
processing features enabled. http://p.sf.net/sfu/kodak-com
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmond won't start at boot

2009-05-03 Thread Carlo Marcelo Arenas Belon
On Sun, May 03, 2009 at 11:43:33PM -0300, David Chinellato wrote:

 should, of course. I looked at /var/log/daemon.log and found the following
 error message related to gmond:
 
 /usr/sbin/gmond[1956]: Error creating multicast server
 mcast_join=239.2.11.71 port=8649 mcast_if=NULL family='inet4'. Exiting.

the problem here is that gmond is failing to start because it can't bind
to the network, because it was started before it.

adding to the INIT INFO section of your init script :

# Required-Start:   $network

should add the needed dependency that upstart requires to know when to
execute it.

Carlo

PS. untested as I have no jaunty available, but that is the syntax that
IMHO is needed in hardy, for more information check with ubuntu, as this
is an ubuntu specific problem

--
Register Now  Save for Velocity, the Web Performance  Operations 
Conference from O'Reilly Media. Velocity features a full day of 
expert-led, hands-on workshops and two days of sessions from industry 
leaders in dedicated Performance  Operations tracks. Use code vel09scf 
and Save an extra 15% before 5/3. http://p.sf.net/sfu/velocityconf
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmond won't start at boot

2009-05-03 Thread Carlo Marcelo Arenas Belon
On Mon, May 04, 2009 at 03:50:12AM +, Carlo Marcelo Arenas Belon wrote:
 
 PS. untested as I have no jaunty available, but that is the syntax that
 IMHO is needed in hardy, for more information check with Ubuntu, as this
 is an Ubuntu specific problem

spun up a jaunty amd64 VM and rebuilt the debian experimental package for
3.1.2 to see if the problem you reported was also happening with it.

seem to be working just as expected at least for gmond (using dhcp in
the network though).

if interested binary packages are available from :

  http://sajino.sajinet.com.pe/ganglia/jaunty/amd64/

and the package for debian experimental (linked from our release page AFAIK)

  http://packages.debian.org/source/experimental/ganglia

Carlo

--
Register Now  Save for Velocity, the Web Performance  Operations 
Conference from O'Reilly Media. Velocity features a full day of 
expert-led, hands-on workshops and two days of sessions from industry 
leaders in dedicated Performance  Operations tracks. Use code vel09scf 
and Save an extra 15% before 5/3. http://p.sf.net/sfu/velocityconf
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetad : Permission denied to rrd_rootdir

2009-04-14 Thread Carlo Marcelo Arenas Belon
On Mon, Apr 13, 2009 at 04:00:05PM -0400, Jorge Medina wrote:
  
 rrd_rootdir  /home/developer/opt/ganglia/var/rrds
  
 and I am getting the following error:
  
 Going to run as user nobody
 Please make sure that /home/developer/opt/ganglia/var/rrds exists:
 Permission denied
  
 Who should own the directory and/or what permissions need to be
 available in that directory?

has to be own by nobody and that user has to be able to access it (+x)
and write to it (+w) so that it can create a directory hierarchy showing
your clusters/nodes/metrics.

remember that to be able to access a directory you have to also be able
to access all other directories from starting from / and that will
basically require that the home directory for developer be publicly
accessible in most cases.

 Currently it is owned by my user developer, I tried giving rwx
 pemission to everyone, but I still get the same message.

if you want to run gmetad as user developer then you have to start
gmetad as that used and disable setuid in /etc/ganglia/gmetad.conf :

  setuid off

Carlo

--
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] leopard launchd plist

2009-04-14 Thread Carlo Marcelo Arenas Belon
On Mon, Apr 06, 2009 at 01:43:40PM -0400, Evans, Ryan E wrote:
 If anyone is interested I have working plist files for starting both
 gmond and gmetad using launchd under OSX leopard (10.5.6).

do you have a working build for Leopard?, which version of ganglia are
you using since AFAIK 10.5 was broken because there was no more
a public header for kvm as shown in :

  http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=168

 New to the list and not sure if this has been covered already.

not a MacOSX person myself, but I presume they are just test files, and if
that is the case they could be added to the contrib directory with some
instructions.

Carlo

--
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Building Ganglia on Solaris 10 (amd64)

2009-04-07 Thread Carlo Marcelo Arenas Belon
On Mon, Apr 06, 2009 at 06:02:45PM -0400, Jorge Medina wrote:
 
 At this point, I get the following errors:
 
  gcc -DHAVE_CONFIG_H -I. -I. -I.. -m64 -O3 -std=c99
 -I/home/sdkadmin/opt/ganglia/apr/include/apr-1 -I.. -I../../lib
 -I../../include -m64 -O3 -std=c99
 -I/home/sdkadmin/opt/ganglia/apr/include/apr-1 -Wall -DHAVE_STRERROR -MT
 metrics.lo -MD -MP -MF .deps/metrics.Tpo -c metrics.c  -fPIC -DPIC -o
 .libs/metrics.o
 In file included from /usr/include/procfs.h:26,
  from metrics.c:62:
 /usr/include/sys/procfs.h:111: error: field `pr_action' has incomplete
 type
 /usr/include/sys/procfs.h:112: error: syntax error before stack_t
 /usr/include/sys/procfs.h:130: error: syntax error before '}' token
 /usr/include/sys/procfs.h:164: error: syntax error before lwpstatus_t

this is a bug on Solaris headers triggered because of -std=c99 while
building libmetrics, as a workaround you can build metrics.c without
that and then use it to generate the static library that will be linked
with gmond later.

Carlo

--
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] multiple gmond daemons on one host

2009-01-10 Thread Carlo Marcelo Arenas Belon
On Fri, Jan 09, 2009 at 08:00:13PM -0500, Jesse Alvarez wrote:
 Good question. My understanding is that you need one gmond as a
 collector per cluster.

one or more if you need redundancy, but that gmond is in no way special
(other than the fact that listens in TCP 8649 by default) so there is no
need to have a collector server running that gmond, when any other
gmond running in your nodes already could do it.

Carlo

--
Check out the new SourceForge.net Marketplace.
It is the best place to buy or sell services for
just about anything Open Source.
http://p.sf.net/sfu/Xq1LFB
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Setting up Ganglia

2008-11-28 Thread Carlo Marcelo Arenas Belon
On Fri, Nov 28, 2008 at 10:59:41AM +0200, Johann Spies wrote:
 
 /usr/local/sbin/gmond -t  /etc/ganglia/gmond.conf

presume  you restarted gmond after changing the configuration

 sudo gstat -a
 CLUSTER INFORMATION
Name: unspecified
   Hosts: 1
 Gexec Hosts: 0
  Dead Hosts: 0
   Localtime: Fri Nov 28 10:55:13 2008
 
 CLUSTER HOSTS
 Hostname LOAD   CPU
 Gexec
  CPUs (Procs/Total) [ 1, 5, 15min] [  User,  Nice, System,
  Idle, Wio]
 
 head001.sun.ac.za
 4 (1/  293) [  0.94,  0.56,  0.76] [  24.9,   0.0,   0.7,
 73.2,   1.3] OFF
 
 With no information from comp001-comp021 although gmond is running on
 each one of them.

if iptables is off in all of them and they are all (including head001) in the 
same network
segment, then you have a network problem.

# tcpdump host 239.2.11.71

should show you are getting multicast messages from all the nodes, but
probably only shows packets from head001 instead.

 The reason for my previous mail was that the person described a
 solution in his/her situation using monocasting and not multicasting.
 I suspect multicasting is not working on my system.

to change to unicast what you have to do is change the configuration
from :

/* Feel free to specify as many udp_send_channels as you like.  Gmond 
   used to only support having a single channel */ 
udp_send_channel { 
  mcast_join = 239.2.11.71 
  port = 8649 
  ttl = 1 
} 

/* You can specify as many udp_recv_channels as you like as well. */ 
udp_recv_channel { 
  mcast_join = 239.2.11.71 
  port = 8649 
  bind = 239.2.11.71 
} 

into (assuming all your nodes can resolve that name, otherwise use the ip)

udp_send_channel {
  host = head001.sun.ac.za
  port = 8649
}

udp_recv_channel {
  port = 8649
}

then restart all your gmond with the new configuration (presume iptables
is disabled in all the other nodes as well)

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetric fails when disk is unwriteable?

2008-11-28 Thread Carlo Marcelo Arenas Belon
On Wed, Nov 26, 2008 at 08:40:55AM -0700, Brad Nicholes wrote:
  On 11/26/2008 at 3:45 AM, in message
 [EMAIL PROTECTED], Martin Knoblauch
 [EMAIL PROTECTED] wrote:
  
  From: Brad Nicholes [EMAIL PROTECTED]
  
   On 11/25/2008 at 10:14 AM, in message 
  [EMAIL PROTECTED],
  Ofer Inbar wrote:
   Brad Nicholes wrote:
   It needs a temp directory to get around some issues with libconfuse.
   Libconfuse doesn't actually support wildcard paths or files.  A
   libconfuse include statement must have a full path to the file that
   it is going include.  So gmond makes up for this problem by creating
   a temp file, resolving all of the file paths and names and then
   writing them as separate includes in a temp file.  Then it tells
   libconfuse to include the temp file directly.  Without the ability
   to resolve the wildcard paths and write them to a temp file, the
   wildcarding feature of gmond wouldn't work.  To solve the problem
   that you are describing, we would have to actually add wildcard
   capability to libconfuse.
   
   Might this be cleaner workaround that would work for gmond as well?
   
- override libconfuse's include function as you're already doing
- resolve file paths and names as you're already doing
- instead of writing that to a temp file and telling libconfuse to
  include that file, just tell libconfuse to include each individual
  file (the same filenames you're now writing to the temp file)
   
  
  No, libconfuse doesn't work that way.  The include handler can only 
  manipulate 
  the file path that it is handed.  So the result of the handler has to be a 
  single absolute file path.  There isn't any way to take a single file path 
  as 
  input into the handler and return multiple file paths back to libconfuse.  
  The 
  only way to do it was to write all of the individual file paths to a file 
  and 
  then hand libconfuse back a single file path to the new include file.
  
  
   the question is: can't the handler be rewritten to the conversions in 
  memory, without needing to write a temp file? This would make the process 
  more robust. You never know when a disk is full, or goes RO.
 
 No, I tried doing that already but was unsuccessful.  Libconfuse
 is limited in what you can do in this area.

the API libconfuse exports is limited to handling single file includes
(as documented) so it shouldn't be a surprise that it wouldn't handle a
wildcard include with it.

 The problem is that when libconfuse wants to read in the include file,
 it is in the middle of the lexer and needs to continue.  A handler can't
 just read the file and hand it back to libconfuse through some other
 cfg_* call.

an alternative will be to preprocess the configuration file and feed it
into a buffer in memory, resolving all includes, and then call
libconfuse to parse and process the buffer instead.

this would have also the nice side effect of preventing gmond/gmetric to
segfault if there is no gmond.conf (hence using the embedded
configuration) and there are files in the include path (as documented in
the release notes since 3.1 for requiring gmond.conf if using modpython).

 This may be a design flaw in libconfuse but it is the way it works now
 and we have to live with it.

since AFAIK no libconfuse developer was ever notified of their flaw it
might be as well that our implementation is abusing their API.

will check with them and update back with any suggestions.

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetric fails when disk is unwriteable?

2008-11-26 Thread Carlo Marcelo Arenas Belon
On Tue, Nov 25, 2008 at 04:33:05PM -0700, Brad Nicholes wrote:
 
 The result was that if the wildcard produced more than 10 included files
 (which it easily does even in our default configuration), libconfuse
 choked because it thought it had hit the maximum nesting level

our RPMs for ganglia only install 3 files in /etc/ganglia/conf.d; gentoo
has 2 and fedora 10 (just released) has 4.

even if I agree that 10 is somehow low and you would expect that as more
modules are deployed it will be soon problematic, it would seem that at
least in this case, one problem was traded for another.

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetric fails when disk is unwriteable?

2008-11-25 Thread Carlo Marcelo Arenas Belon
On Mon, Nov 24, 2008 at 04:55:42PM -0700, Brad Nicholes wrote:
  On 11/24/2008 at 3:47 PM, in message [EMAIL PROTECTED],
 Ofer Inbar [EMAIL PROTECTED] wrote:
   I tried feeding one of my custom metrics by hand:
   [root ~]$ gmetric --name net_smtp_fin_wait2_out --value 0 --type uint8 
   --units 
   'connections'
   /etc/ganglia/gmond.conf:94: failed to determine the temp dir
   Parse error for '/etc/ganglia/gmond.conf'
 
 It needs a temp directory to get around some issues with libconfuse.

gmond does; gmetric doesn't need anything more than to know which
channel to use (hence nothing in the includes) and it is getting
blocked by this restriction because of its use of libganglia to
read gmond's configuration through libgmond.

 To solve the problem that you are describing, we would have to actually
 add wildcard capability to libconfuse.

libconfuse is instructed to use our implementation for includes and that
uses a temporary file, so this is fixable in our code.

a fix to the problem reported by Ofer only needs our handler modified
so that failures to create temporary files to handle includes are not
treated as fatal as Committed revision 1922

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetric fails when disk is unwriteable?

2008-11-25 Thread Carlo Marcelo Arenas Belon
On Fri, Nov 21, 2008 at 11:33:05PM -0500, Ofer Inbar wrote:
 
 What's the dependency that causes gmetric to require that the
 filesystem the CWD is on be writeable?

as explained by Brad it is not the CWD that needs to be writeable but a
TMPDIR (which for root can also be the current directory) and that is
detected by APR.

Recent Linux (since around kernel 2.4.16) requires a ramdrive mounted in
/dev/shm, so one way to workaround this problem is to define :

  TMPDIR=/dev/shm

3.0 gmetric is not affected and so could be also used as an alternative.

Carlo

PS. SysVinit workaround for gmond Committed revision 1923

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Gmond Stops Sending Metric after Ntpd Adjusted System Time

2008-11-15 Thread Carlo Marcelo Arenas Belon
On Fri, Nov 14, 2008 at 05:03:19PM +0800, ada wrote:
 
After ntpd adjusts the system time from 8pm to
12am, gmond stops sending any metric out. With the help of tcpdump, I
don't see gmond sending any udp message out.

you mean your node clock was off and when ntpd started it was
adjusted forward 4 hours, backward 8 hours or backward 20 hours?

I guess there is some mechanism in gmond that assure gmond sending
information certain time after last sending.

gmond main job is to schedule metric collections, and for that it need a
monotonically increasing time source.

gmond will find that last sending time is in the future. In this
way, it will wait several hours to send out information again.

this is most likely true, and a bug if that is the case (I haven't seen
a gmond recover for this though, so it might be even worst)

in any case, keeping your time on sync in all your cluster is a prerequisite
to having a good ganglia setup as the local time from the nodes is trusted
in several places.

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] High system load when gmond is running

2008-11-15 Thread Carlo Marcelo Arenas Belon
On Thu, Nov 13, 2008 at 03:08:46PM -0800, [EMAIL PROTECTED] wrote:
I looked into it further and it looks like my problem isn't gmond its
gmetad.

gmetad is an IO intensive application and therefore will raise your load
because of blocked processes if your disk can't keep up with it.

the directory where your rrd are being stored (usually /var/lib/ganglia)
should be able to handle high IOPS for all the updates to the RRD files
for all the nodes you are collecting data from.

in the worst case (and since the RRD are usually small and usually systems
have more memory than they need) you could probably run it on a ramdrive.

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Help with HP-UX please

2008-11-14 Thread Carlo Marcelo Arenas Belon
On Fri, Nov 14, 2008 at 07:59:48AM +, Lockwood, David wrote:
I am looking for a how-to that is specific for compiling gmond for HP-UX
11.11 on PA-RISC please

the last release of ganglia (3.1.1) won't be able to work in HP-UX but the
older release from the maintenance branch (3.0.7) should compile cleanly, at
least using gcc by doing :

  $ ./configure
  $ make
  $ make install

sadly AFAIK no one of the developers have access to an HP-UX system and so
providing a binary will be most likely impossible, if you find any issues
and are willing to help troubleshoot them, it might be possible to get them
resolved though.

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Help with HP-UX please

2008-11-14 Thread Carlo Marcelo Arenas Belon
On Fri, Nov 14, 2008 at 09:01:18AM +, Lockwood, David wrote:
 Will a 3.0.7 gmond process be able to provide data for a 3.1.0 gmetad
 process?

yes, as far as you keep it talking only with other 3.0 gmond (if using
multicast, then you will need to use a different multicast IP for your 3.0
gmond, and use a different datasource in your gmetad to poll them).

a 3.1 gmetad is able to read the XML generated by a 3.0 gmond, but the XDR
protocol that gmond talk between them is incompatible between 3.0 and 3.1

 I had issues with this on the Windows servers that I monitor and had to
 make sure that the versions were the same.

the same applies to windows, but considering that 3.1 has far better support
for windows metrics than 3.0 it is probably a good idea to upgrade them to 3.1
anyway.

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] BOF at LISA?

2008-11-07 Thread Carlo Marcelo Arenas Belon
On Fri, Nov 07, 2008 at 03:46:41PM -0800, Bernard Li wrote:
 
 On Fri, Nov 7, 2008 at 3:38 PM, Ofer Inbar [EMAIL PROTECTED] wrote:
 
  I don't know that most Ganglia users who are attending LISA are also
  on this email list - I would expect the number who aren't on this list
  is greater than the number who are.
 
 If you have a good idea on how to reach out to them in the next few
 days, I'm all ears.

from what I recall, BOFs are pretty easy to setup; you can just follow
the instructions from :

  http://www.usenix.org/events/lisa08/bofs.html

most of the people interested in monitoring might be already aligned for the
BOFs in the Hampton room for Wednesday anyway and the BOF schedule is
published to all participants at registration as well.

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Gmond losing data?

2008-10-30 Thread Carlo Marcelo Arenas Belon
On Thu, Oct 30, 2008 at 12:18:07PM -0400, Lars Kellogg-Stedman wrote:
 Hello all,
 
 I'm experiencing some odd problems with Ganglia (3.1.1, under CentOS
 5).  Sometimes, gmond stops collecting data from remote hosts.

If you are using unicast and you restarted your collector (the gmond that
gmetad is pulling from) then you are running into a known issue with the way
that metadata is getting updated mentioned in the release notes :

  http://ganglia.wiki.sourceforge.net/ganglia_release_notes

The suggested configuration is to set send_metadata_interval to a non zero
value.

 Gmond merrily ignores the data.  It seems that restarting the *local*
 gmond does not correct the problem, but restarting the *remote*
 (sending) gmond does make things start working again.

because the metadata is resent.

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Monitoring NFS share disk usage

2008-10-28 Thread Carlo Marcelo Arenas Belon
On Tue, Oct 28, 2008 at 05:30:25PM +1100, Adam Mitchell wrote:
 
#!/bin/bash
VALUE=$(df /home/ | grep /home |awk '{print $3 }')
gmetric --name disk_nfs_used --value $VALUE --type uint32 --units Bytes

not relevant for your problem but units here should be KB

gmond is running on the head node. However, there doesn't seem to be any
rrd's being produced.

the rrd are generated by gmetad, which in turns reads them from your gmond.
for that to work on your setup you need to configure (/etc/gmond.conf) the
gmond in your head node with the same cluster name (shiva) and collector
than your work nodes, and then confirm that your gmetad is configured
(/etc/gmetad.conf) to pull the status for your cluster (shiva) including your
new metric.

telnet to port 8649 in your collector (any gmond if using multicast, or the
one that your gmetad is pointing to if using unicast) should dump an XML
description of your cluster and include the new metric you just created with
gmetric inside a host definition from your head node.

I  have added the flowing lines to the cluster_view.tpl
IMG HEIGHT=147 WIDTH=395 ALT={cluster} DISK
   

 SRC=./graph.php?c=shivaamp;h=shiva.edag.clusteramp;v=233.904amp;m=disk_nfs_usedamp;r=houramp;z=mediumamp;jr=amp;js=amp;st=1225161877vl=GB
/TD

v, st, z, r are better pulled from the environment as you will be
otherwise hardcoding some of the values for your graph.

since you are trying to import a metric graph in a cluster view, that might
not work correctly anyway and so changes to graph.php might be needed too.

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Ganglia support in AIX

2008-10-20 Thread Carlo Marcelo Arenas Belon
On Mon, Oct 20, 2008 at 09:17:08AM -0400, Sai p Seshasayee wrote:
 
We have been using Ganglia to monitor our cluster on Linux and now we are
planning to use Ganglia to monitor our AIX clusters. I have visited the
Ganglia website and found no rpms for AIX (version 3.1.1).

3.1.1 runs ins AIX but has no way yet to use DSO metrics (it could probably
run python metrics though) and because of that 3.1.1 is almost equivalent to
3.0.7 (except that that some bugs were fixed and others introduced and they
have a different unit for the disk metrics)

I am just
interested to know if Ganglia is supported in AIX and whether you are
planning to build rpms for AIX?

As pointed before Michael has RPMs for the 3.0 version and he is probably
working on getting also some for 3.1 but you could probably use the README.AIX
and other documentation to generate your own if needed.

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Deploying a gmetric collection

2008-10-19 Thread Carlo Marcelo Arenas Belon
On Sun, Oct 19, 2008 at 02:46:55PM -0700, Ed Greenberg wrote:

 I looked at the source and see that the 
 GAUGE/COUNTER choice is a Ganglia 3.1 feature. It's not present in the 
 3.0 code.
 
 So I'm going to give 3.1.1 a try.

beware that 3.0 and 3.1 are NOT compatible at the XDR level and therefore
you can't mix 3.0 and 3.1 gmond in the same cluster (as defined by the
multicast or unicast address used for a collector).

for more details on how to upgrade check the release notes in :

  http://ganglia.wiki.sourceforge.net/ganglia_release_notes

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] adding custom graphs to ganglia-web view

2008-09-16 Thread Carlo Marcelo Arenas Belon
On Mon, Sep 15, 2008 at 02:29:37PM -0400, Ofer Inbar wrote:
 
 I can add the names of my custom graphs to get_context.php, and then
 they show up in the menu at the top of the cluster view that lets me
 select what graph to show for each node at the bottom.

why not add them to $optional_graphs in web/conf.php instead?

I agree with you that at the end it might make more sense to treat
all the reporting graphs the same regardless of if they are custom
or part of the core just like metrics are treated now in 3.1

 However, it'd also be really useful to:
 
 1. See the custom graphs relevant to a cluster show up in the cluster
view at the top, summarizing or totaling the whole cluster

agree, but you are going to have to be able to configure which reports
you want to have per cluster (as you will most likely need different
reports in each cluster) otherwise you can just change the template
(web/templates/default/cluster_view.tpl) to show the list of reports
 you are interested for all cluster pages (in that case is probably
a good idea to create your own template name and refer to it in conf.php)

 2. See those graphs in the grid view, in each cluster's summary

if it is the same list for all clusters can be done in the template
(web/templates/default/meta_view.tpl), if you want a different list per
cluster then you could probably use the same configuration defined above,
but could get difficult to handle and to understand as space is limited
in the grid view per cluster and is IMHO useful to have all cluster
reporting the same metric vertically so you can see global trends easily.

 3. For each node, have it default to showing the graphs we deem most
relevant to its cluster, in the first section of the node view

OK, you will also need to know how many reports to show to keep the layout
useful, as again the layout will need to adjust dynamically.

 Are there easy ways to do this with the existing ganglia-web package?

no, because there is no per cluster configuration where this information
could be stored yet, but as I mentioned before, tweaking the templates could
help.

 Is anyone working on making this possible?

in trunk there has been changes to the grid view to have the four reports
from cluster view and in the same order (plus adding zoom) as you can see from
BUG184:

  http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=184

as part of that implementation there has been some discussion about doing
something similar to what you proposed, but no code has been produced AFAIK
(CC all probably interested parties in case code was produced but not yet made
public)

 Note: I'll probably send some of my custom metric packages  graphs
 for inclusion in contrib, but I'll be much more motivated to polish
 them up for public consumption if I know people could use them easily.

ganglia development is always open, so forward your patches and suggestions
through bugzilla or the ganglia-developers list, which will be where this
could be accomplished.

Carlo

PS. moving thread to ganglia-developers to followup with development
suggestions and code.  replying might require subscribing to that list first

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmond stops collecting client metrics

2008-09-11 Thread Carlo Marcelo Arenas Belon
On Thu, Sep 11, 2008 at 01:02:10PM -0500, Martin Hicks wrote:
 
 Hi,
 
 ganglia-3.1.1 on x86_64 linux.  I just upgraded my RPMS to 3.1.1 this
 morning to see if there were any fixes for this problem.
 
 I've put up configs, XML and a PNG here:
 
 http://bork.org/~mort/ganglia/
 
 The node where this picture is taken has two blades reporting to it.
 When gmond restarts on this node then the blades no longer seem to
 successfully report metrics.

this is a known problem from 3.1 when using unicast as you have in your
setup (the last bullet point in Important Notes)

  http://ganglia.wiki.sourceforge.net/ganglia_release_notes

to workaround it you have to add a non-negative value for
send_metadata_interval to your nodes.

 I went back to gmond-3.0.7 and the problem disappears.

3.0 won't ever had this problem as there is no metadata handling there

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Help: nobody user problem

2008-09-09 Thread Carlo Marcelo Arenas Belon
On Mon, Sep 08, 2008 at 04:17:20PM -0700, Bernard Li wrote:
 
 On Mon, Sep 8, 2008 at 4:12 PM, Lee Amy [EMAIL PROTECTED] wrote:
 
  Thank you all. My cluster uses Cent OS 4. And your description is really
  clear, thank you very much!
 
 In the future you probably want to hit 'reply-all' when replying so it
 goes to the mailing-list as well.  The email you sent only got to me.

also, for some more interactive help, you might want to use the IRC channel
#ganglia in freenode as described in :

  http://ganglia.info/?page_id=68

Bernard, from the core developers, hangs out there with frequency as well as
other users/developers of ganglia which might be able to help as well.

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmond 3.1 not reporting host data

2008-09-09 Thread Carlo Marcelo Arenas Belon
On Mon, Sep 08, 2008 at 01:42:16PM -0500, Ryan Robertson wrote:
I too am having trouble getting the gmond collector report data of
itself.

presume that you are referring to some other report from ganglia 3.1
not being able to get its own data here based on the subject, but
the behaviour observed is from 3.0 based on the body, could you provide a
reference to the original report as it might be an unrelated problem.

running `gmond -d10` should generate a log of what is going on that
could help trace the problem, but from the configuration shown below it might
be just an unintended misconfiguration.


I've tried mulitple variations on the gmond.conf, but can't seem
to find a combination that works.  This is on power5 AIX 6.1 running
ganglia-gmond-3.0.7-1. 
 
/* Feel free to specify as many udp_send_channels as you like.  Gmond
   used to only support having a single channel */
udp_send_channel {
  host = 127.0.0.1
  port = 8649
  ttl = 1
}
 
/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
  mcast_join = 239.2.11.71
  port = 8649
//  bind = 239.2.11.71
  bind = 10.50.54.31
}
 
/* You can specify as many tcp_accept_channels as you like to share
   an xml description of the state of the cluster */
tcp_accept_channel {
  port = 8649
}

you have to match the mode used (multicast or unicast) for udp_send_channel
and udp_recv_channel even if we allow the configuration of mismatching sets
which will result in problems like the one you are observing (that is a bug)
but mainly because the flexibility of the configuration allows for some
strange settings that we would be otherwise not able to predict (like having
additional unicast messages sent somewhere different than a gmond for reporting)

the following should work in your case :

* plain multicast configuration as used by default (you need multicast support
  working for your system and enabled/routed correctly)

  udp_send_channel {
mcast_join = 239.2.11.71
port = 8649
ttl = 1
  }

  udp_recv_channel {
mcast_join = 239.2.11.71
port = 8649
  }

* plain unicast configuration through localhost (not what you want in the long
  run as it will use localhost as the node name)

  udp_send_channel {
host = 127.0.0.1
port = 8649
  }

  udp_recv_channel {
port = 8649
  }

* plain unicast configuration through working interface (assuming that
  10.50.54.31 is configured in one of your interfaces)

  udp_send_channel {
host = 10.50.54.31
port = 8649
  }

  udp_recv_channel {
port = 8649
  }

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] [ANNOUNCEMENT] Official release of Ganglia 3.1.1

2008-09-09 Thread Carlo Marcelo Arenas Belon
On Tue, Sep 09, 2008 at 01:53:43PM -0400, Ofer Inbar wrote:
 Some questions regarding upgrading:

was going to realign the release_page to make that a little more obvious but
it was a little late to do any major changes to it, which is why it is not
explicitly there.

 1. Can gmond 3.1.1 nodes coexist compatibly in the same cluster with
gmond 3.1.0 nodes?
 
 2. Can a gmetad 3.1.1 use gmond 3.1.0 nodes as data sources?
Can a gmetad 3.1.0 use gmond 3.1.1 nodes as data sources?

yes, 3.1.0 and 3.1.1 are completely compatible and if you are using 3.1.0 you
are encouraged to move to 3.1.1 ASAP (the 3.1.0 package from both Gentoo and
Fedora include some of the critical fixes in 3.1.1 already, for example)

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmond 3.1 not reporting host data

2008-09-09 Thread Carlo Marcelo Arenas Belon
On Tue, Sep 09, 2008 at 01:07:10PM -0500, Ryan Robertson wrote:
My goal was to have multiple nodes reporting to a central location
(10.50.54.31) also running gmond and reporting info on itself as well.

then you need all gmond configured with the same cluster name and setup to
use unicast with 10.50.54.31 as the collector.

the only one that should need to have tcp_accept_channel so you can query it
in TCP/8649 (what gmetad will do) will be the collector.

To accomplish this, wouldn't I configure the clients that will be sending
data something to this effect:
--
/* Feel free to specify as many udp_send_channels as you like.  Gmond
   used to only support having a single channel */
udp_send_channel {
  host = 10.50.54.31
  port = 8649
  ttl = 1
  mcast_if = en2
}
--

no you are missing unicast and multicast in that configuration (which probably
shouldn't be allowed to work but does anyway as explained before).  you need
instead :

  udp_send_channel {
host = 10.50.54.31
port = 8649
  }

The 10.50.54.31 was the original gmond.conf posted.

because this one has to collect all data send to him from the other nodes it
also needs :

  udp_recv_channel {
port = 8649
  }

and because it is the only one that has all the nodes information and will be
queried from gmetad from it it should also have :

  tcp_accept_channel {
port = 8649
timeout = -1
  }

I'm assuming the
gmetad server (10.50.54.48) reaches out to the nodes defined in
gmetad.conf for information.

in a unicast configuration it should only point to the collector gmond,
because that is the only one collecting all metrics from the cluster with
something like :

  data_source ganglia 10.50.54.31

where ganglia hopefully describes the use of your cluster and matches what
you configured in your gmond.conf as the cluster name with something like :

  cluster {
name = ganglia
  }

Is there a method in which they only are
aware of their own information?

not sure what you mean here, but if you use each node as its own collector
then will only have their own information, but also by definition they will
be also their own independent cluster (even if the cluster name is the
same on all of them) and you will have no way to collect them in a single
view in the frontend, because gmetad doesn't aggregate data from multiple
sources.

Lastly, I have a few nodes where HACMP is in place using IP Aliasing on a
single interface.  In these cases, i need to bind gmond.conf to a
particular IP.

if you are using unicast you won't need to do that, if you are using multicast
you will need to stay with 3.0.7 or wait for 3.1.2 to be able to use mcast_if,
or change your routing table to ensure that multicast is routed through the
virtual interface.

also remember that you can't mix 3.0 and 3.1 gmond in the same cluster so
you have to use the same version in all nodes (including the collector)

look at the previous reply with examples and gmond.conf(5) for more details
on configuring ganglia.

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Help: How to add hosts?

2008-09-07 Thread Carlo Marcelo Arenas Belon
On Sun, Sep 07, 2008 at 01:22:05PM +0800, Lee Amy wrote:
 
How to add hosts to Ganglia system? Foe example, I have 5 nodes, one
master and four slaves, what if I want to see their system performance,
how could I add hosts? In which configuration file?

if you install gmond in all 5 nodes and also gmetad and the web application
in the master node and start them all you should be able to see their system
performance.

it is important that you familiarize yourself on how ganglia works and how
to configure each piece and for that the following wiki could help (even if it
is a little more oriented at running ganglia in IBM hardware)

  http://www.ibm.com/developerworks/wikis/display/WikiPtype/ganglia

be sure also of reading all README and man pages that came with the software
so that you can tune it to your environment

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Help: APR problem

2008-09-06 Thread Carlo Marcelo Arenas Belon
On Sun, Sep 07, 2008 at 12:02:59PM +0800, Lee Amy wrote:
Hi,
 
I'm a newbie in Ganglia, I've installed apr correctly, the location is
/usr/local/apr by default. But when I run make to compile the Ganglia, it
shows following error messages and terminated.
 
In file included from scoreboard.c:7:
gm_scoreboard.h:4:23: apr_pools.h: No such file or directory
scoreboard.c:8:22: apr_hash.h: No such file or directory
scoreboard.c:9:25: apr_strings.h: No such file or directory
make[2]: *** [scoreboard.lo] Error 1
make[2]: Leaving directory `/root/tmp/ganglia-3.1.0/lib'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/root/tmp/ganglia-3.1.0'
make: *** [all] Error 2
 
But I check the /usr/local/apr/include directory, the directory contains
the necessary header files.
 
Could you tell me how to fix this problem?

before running `make` you have to configure your source so it will be able to
find all the dependencies that are needed and when you do that you have to
tell it where is that you installed APR by using something like :

  $ ./configure --with-libapr=/usr/local/apr/bin/apr-1-config

you might also need the location for other libraries like libconfuse or expat
if they were not already installed.

Carlo
 
Thank you very much~
 
Best Regards,
 
Amy Lee

 -
 This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
 Build the coolest Linux based applications with Moblin SDK  win great prizes
 Grand prize is a trip for two to an Open Source event anywhere in the world
 http://moblin-contest.org/redirect.php?banner_id=100url=/
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general


-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetad bug when gmond host hangs

2008-09-02 Thread Carlo Marcelo Arenas Belon
On Tue, Sep 02, 2008 at 12:54:07PM -0400, Ofer Inbar wrote:
 Brad Nicholes [EMAIL PROTECTED] wrote:
  Thanks Carlo, this is some good feedback.  I know that both Bernard
  and Cos have reported having issues with this bug.  Could either (or
  both) of you independently confirm that this patch fixes the problem?
 
 To reproduce this bug, I'd need a host in a state where it accepts TCP
 connections but then leaves them hung, which is not something I want
 to do on any of my production hosts,

it shouldn't be a problem at all if your failover sources are setup correctly
anyway.

you don't need to crash the machine, but just stop the gmond process by
running something like :

  # kill -STOP `pidof gmond`

to fix it after you are done you can do :

  # kill -CONT `pidof gmond`

you will need a patched gmetad though, but doesn't need to be the same you
have in production either, even if I'd expect you to roll it there quickly if
this problem is really a showstopper for your 3.1 production deployment as
Brad seemed to think.

 If anyone out there on the list has a way to set up a Ganglia
 testing cluster and then deliberately put one of the data sources in
 his state, wanna test out this patch?

that is what I did, but I have to admit that my test environment was tiny as I
only used 1 linux box (my gentoo linux workstation) and 1 windows box (a
windows vista box where I build my windows ganglia binaries) configured
together in one single cluster running 3.1 (the failover source wasn't setup
correctly though as I don't have a way to synchronize the clocks between them
both, and they are in different VLANs and my little linksys switch can't do
multicast routing)

Brad is probable looking for someone else to come out with a more realistic
production like test, but if no one can do that, I might be able to configure
it by moving around some cables and trying to setup a more realistic failover
scenario (running linux in the windows box) even if that probably defeats the
indepent confirmation part of the testing request.

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] gmetad bug when gmond host hangs

2008-09-01 Thread Carlo Marcelo Arenas Belon
On Sat, Aug 30, 2008 at 10:09:02AM -0600, Brad Nicholes wrote:
  On 8/30/2008 at 12:25 AM, in message [EMAIL PROTECTED], Carlo
 Marcelo Arenas Belon [EMAIL PROTECTED] wrote:
  On Fri, Aug 29, 2008 at 02:40:00PM -0400, Ofer Inbar wrote:
  Should this have made it into 3.1, or 3.1.1?  It
  doesn't look like it.
  
  There is a fix in trunk now with r1738 and unless something goes wrong with
  it, will be most likely released with 3.1.2 and 3.0.8.

The proposed backport patch for 3.1.x has been updated in the BUG and
officially requested for inclusion in 3.1 (beware it includes 1 extra
unrelated change that is needed to prevent future conflicts for backporting
but that is otherwise mostly irrelevant, specially if making your own package)
but also additional changes that simplify the logic and avoid a possible
failure of logic which could result in gmetad crashing, so using this newer
version is encouraged :

  
http://bugzilla.ganglia.info/cgi-bin/bugzilla/attachment.cgi?id=165action=view

  3.1.1 is already in testing and since this bug is not a showstopper for that
  specific release, I'd be surprised if the release manager decides it should
  be backported to it, but that shouldn't prevent you patching your own 
  package with the proposed patch if you don't want to wait.
 
 If we are confident enough that the patch for this bug solves the problem

I am fairly confident that the patch resolves the problem reported in BUG92
and that matches the description of the problem from Cos, and that is easy
to replicate by just sending a SIGSTOP to gmond; otherwise I wouldn't have
proposed it here, in bugzilla and the STATUS file.

 then I have no problem scraping 3.1.1, tagging and rolling 3.1.2 and
 restarting the testing period.  The whole point of the testing period is
 to flush out problems like this and then determine if the fix is important
 enough to retag and retest.

agree, but doing so will delay releasing the next version of 3.1 and also
indirectly (as I won't start that until 3.1 is out to avoid confusion and
overstressing our limited testing resources) the next release for 3.0.

3.1.1 includes a fix for a showstopper bug (gmetad crash in hierarchical
configuration) and a fix for a high bug (instability with tcpconn.py) which
had been already backported in fedora and gentoo distribution packages, but not
debian, our own packages or anyone using 3.1 from sources (all other
architectures where there are yet no packages available) and that are
therefore waiting for a 3.1.1 release.

 We need some feedback on how serious this problem is

for 3.0 will require that you either have another problem as well like :

* gmond will hang (either because of another gmond bug or because of a
  misconfiguration of gmond that has too many or too slow consumers for
  the number of tcp_accept_channel defined and their configured timeout
  values and that should most likely affect all the sources as well)
* system will hang (either a kernel level deadlock, some security/network guy
  playing games on the cluster or bad hardware)
* a gmond becoming unresponsive because it is collecting data from to many
  nodes or getting swapped out (this should be moved to another system or
  segmented further)

3.1 has the same source of problems, except that a misconfiguration in gmond
will be most likely to fail because of the increase in the XML payload but
only if ganglia is misconfigured (your gmond should be queried from the local
LAN from a local gmetad when possible, with hierarchical gmetads able to
extend the setup over WANs, or if absolutely needed to be queried over a WAN,
a timeout should be configured for the tcp_accept_channel, with -1, which
means no timeout and is sometimes recommended for extreme cases).

 how confident we are in the fix.

the fix was designed as the minimally intrusive change that will accomplish the
desired objective without reverting to pre BUG27 scenario, and I expect it to
evolve further in the near future as there are still open issues that will need
to be addressed around that code like (this can get a little technical and will
be better fitted for the ganglia-developer list but is added in here to
complete the explanation under this context, future discussion should better
moved to the ganglia-developers list):

* it is not implemented for gmetad-python yet
* it is not consistently implemented as not all failures will skip a source
  but instead will default to scanning all sources which was the problem that
  BUG27 was meant to fix as that could generate dips.
* it overloads using the last_good_index identifier eventhough it doesn't
  really match the original intention for it as we haven't yet confirmed that
  the source is working.
* I suspect the use of the dead identifier is misused (which explains the
  never ending failure messages that gmetad does) and therefore some
  refactoring around that code might be needed (which could include adding
  some other features)
* I suspect

Re: [Ganglia-general] gmetad bug when gmond host hangs

2008-08-30 Thread Carlo Marcelo Arenas Belon
On Sat, Aug 30, 2008 at 11:00:40AM -0400, Ofer Inbar wrote:
 Carlo Marcelo Arenas Belon [EMAIL PROTECTED] wrote:
   | --- Additional Comment #2 From Timothy Witham 2008-06-05 10:38
   | My patch still loses if we are talking to a gmond affected by Bug#38.
   | In that case, we receive incomplete data, but since it is some data,
   | we keep talking to that host every time.  Maybe we should just talk to
   | a random host every time.  Better to fix Bug#38 though...
  
   That would have the effect of gmetad
   getting incomplete, old data from that gmond, but that seems to be a
   different problem.
  
  yes, and that is why it has a different bug.
 
 What bug is that?

the part with the bug number was snipped out of this email, but will be
reintroduced again below.

 I'm not talking about fixing gmond, I'm talking about having gmetad
 sense when gmond from one meber of the cluster is giving old 
 incomplete data, and trying another cluster member to see if it can
 get newer data.  I didn't see a bug for that, I just saw a note in the
 timeout poll bug mentioned that a solution for it wouldn't handle
 that situation, and saying that's okay, that *should* be separate.

fair, and you are right I should had probably put a comment there with the
reference to the enhancement bug as well, but since I put it in the email, I
thought it wasn't needed anyway since most interesting parties were hopefully
reading this thread or CC in the new bug.  It has been updated now anyway.

 Is there indeed a bug, or a plan by someone, to get gmetad to do this?

basically the enhancement request to do some sort of intelligent load
balancing between sources that is now being tracked in BUG208

  http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=208

   Told solve it in gmetad, we'd want gmetad to have
   some way of judging whether the data it's getting from a gmond is
   fresh and current, which is not the same as judging whether it
   actually *got* the data from the gmond.
 
  the problematic code was introduced as a fix for BUG27 and was
  indeed trying to detect if gmond was able to use the source or not
  by looking at the obvious lack of TCP connectivity.  BUG92 showed
  that the heuristic was incomplete because didn't include gmond/system
  that are hung but still responsive to a TCP three way handshake
 
 That was the *original* discussion on this thread, and that is indeed
 what the timeout poll bug is trying to address.  What I'm saying is
 in reference to Timothy Witham's last note, which is a rather different
 matter.

I was trying to explain how the code evolved and how this problem was
introduced as well as the proposed solutions to hopefully come forward in the
next 2 years of development cycle ;)

Previous to BUG27 (around 2.5.7 or the first releases of 3.0) gmetad did
failover to all sources sequentially until getting valid data, problem of
course (as explained in that bug) is that when the first source for your
datasource was down, then there was an increased delay that resulted in
gaps as the time needed to get data from that datasource increased
proportionally to the number of sources that were down at the beginning
of the datasource (which can get even worst when some security guy decides
to install a firewall that drops packets when misconfigured)

Timothy's solution to BUG92 somehow reintroduces that failover mechanism
by just choosing the source randomly but still doesn't address the healthcheck
issue which results in sometimes selecting the bad source and therefore
results in more gaps.

BUG208 will hopefully address all the current issues by implementing a real
load balancing solution for sources which also does health checking of some
sort (still to be designed) to ensure that bad sources aren't used and
marked down or in some cases correcting the result (as when accounting for
time drifts between sources)

 So I'm confused by your response - not that anything you say
 is confusing, I'm just confused by how it relates to what you
 responded to?

don't worry, I am used to have people confused with my responses, and
I sometimes get myself in long email threads repeating the same concept
again and again in different ways just because of that.

but as Cassandra said recently the trick is in reading and re-reading the
message until it finally clicks or as you did, ask for some clarification,
at least until I can figure out how to use punctuation correctly because
learning English from BASIC left me with some strange aversion to semicolons
and to use dashes inside words.

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia

Re: [Ganglia-general] gmetrics not showing up in cluster-context metric list

2008-08-29 Thread Carlo Marcelo Arenas Belon
On Fri, Aug 29, 2008 at 07:43:08PM -0700, Brad Fino wrote:
Gmetric data is being reported.   If I go to the specific nodes context,
any info piped through gmetric is collected and reported correctly.

good, your data is still there and you could get it reported by manually
changing the m parameter in the url to whatever metric you are interested on
looking at.

However, in the cluster_metrics drop down list I can only get Ganglia's
default metrics displayed.

this list is generated dynamically and uses the first host of the cluster
assuming that if it is there it is in all other nodes and that all nodes have
the same set of metrics.

this usually works fine, even if it is definitely buggy and will be a source
of surprises now that 3.1 encourages custom metrics which could be missing
from the list or be in the list and selectable even if the node doesn't have
it (generating a broken link)

Using 3.0.7 ... this was working for months until I had to reboot the
server and then when gmond and gmetad came back up, my custom gmetrics in
the drop down disappeared.

check why the fist host in your cluster output (telnet 127.0.0.1 8651 in your
gmetad) doesn't have all the metrics defined (maybe gmetric is not being
called there anymore after the reboot) or restart your gmond until the first
host has them all.

Any ideas ?

don't reboot your clusters ;)

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] ganglia on Solaris

2008-08-28 Thread Carlo Marcelo Arenas Belon
On Thu, Aug 28, 2008 at 11:57:34AM -0700, W S wrote:
Hi,
   
Does anyone knows where Ganglia reading it's data  from (expecially Disk   
I/O stats) on Solaris? Is it somewhere under /proc or just iostat output?  

it is reading it from the Solaris kernel through a public reporting interface
kwnon as kstat

  http://developers.sun.com/solaris/articles/kstatc.html

reading kernel statistics through /proc is only done in Linux (as that is
the public reporting interface for it) and there is no implementation for IO
metrics as part of the core metrics even if there are several addons using
gmetric or 3.1 ganglia modules (in C or python) that has been developed to do
so.

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] show hostnames only?

2008-08-27 Thread Carlo Marcelo Arenas Belon
On Tue, Aug 26, 2008 at 10:33:48PM -0400, Jesse Becker wrote:
 On Mon, Aug 25, 2008 at 08:09, Carlo Marcelo Arenas Belon
 [EMAIL PROTECTED] wrote:
  It has been proposed for backporting to 3.1.x--just need one more
  person to sign off on it.
 
  it was proposed by me for backporting Aug 10th and has always my vote
  (only 2 are needed) so if you were OK with your original implementation
  it should had been ready for backport.
 
 Actually, I made several updates to the 3.1.x STATUS file in r1716,
 which you then removed in r1721.

Jesse,

Let me clarify my statement then with some clear timeline to avoid
misunderstandings :

  Jun 23th: Raised the problem as part of testing a backport proposal that
reverted the title/subtitle of the graphs to the style used in
3.0.x to workaround very long subtitles that were clovered
  Jun 24th: You develop r1460 and commit it to trunk
  Jun 24th: I propose some changes and provide a patch with enhancements
including a backport for 3.1.0
  Jun 24th: Bernard says patch isn't ready and you agreed on working on
a fix based on some private discussion that wasn't summarized (*)
  Jun 27th: I provide a fix for the only known open issue after a couple of
days debate that show patch intention was misunderstood
  Jul 15th: 3.1.0 is released for testing without this patch
  Jul 30th: 3.1.0 officially released
  Aug 10th: I add a backport proposal so it will be hopefully discussed and
released with 3.1.1, because even if there are any issues (which
are not yet known or communicated and except for the minor
formatting issue with a proposed fix since Jun 27th) is better
than nothing
  Aug 15th: Brad proposes a freeze and all proposals (including this one) are
punted (with your explicit approval) to 3.1.2
  Aug 22th: You vote this proposal for backport (too late for 3.1.1 though)
  Aug 25th: I commit enhancements to it that hopefully covers all issues (it
does cover all known issues at least because avoids the only known
issue that the Jun 27th proposal was meant to fix and syncs the
modular graph code with what we have in 3.1 to easy backporting)
and restarted voting for the enhanced solution instead so it will
be hopefully part of 3.1.2
  Aug 25th: 3.1.1 is released for testing

that means that as soon as 3.1.1 is released which should be in a couple of
weeks and 3.1.2 is open for development you should be able to commit this
backport if you agree with the currently proposed solution as there is no need
to wait for another vote (as 2 is all that is needed and mine is already there
and was always there).

of course feel free to improve upon it and restart the backporting process if
needed or to split the proposal into different pieces (one that has already
two votes and another one that has only mine for now and that builds upon it)
or even to throw it all away and come out with a more elegant solution; after
all it is your code and the frontend is your area of expertise and you clearly
understand what issue needs to be solved with it.

count on me to review whatever is proposed at the end so that this fix can get
to our users in some future release as IMHO a bugfix or useful enhancement
like this one (which is also in our wishlist from before 3.1.0) shouldn't be
kept out from them just because of procedural issues.

and before someone else misunderstands that last sentence as some kind of
political statement or ranting let me clarify I don't mean to argue about
procedure here or the validity of it but just that I don't think it should
be used as an argument or justification for not delivering solutions to
the ganglia community as it seemed to be implied by your original comment.

Carlo
  
(*) http://www.mail-archive.com/[EMAIL PROTECTED]/msg04355.html

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


[Ganglia-general] using an arbitrary UUID instead of FQDN for storing metrics (was Re: show hostnames only?)

2008-08-27 Thread Carlo Marcelo Arenas Belon
On Tue, Aug 26, 2008 at 09:56:21PM -0400, Ofer Inbar wrote:
 
 I see some discussion of moving away from using hostnames as directory
 names to store RRDs in.

I think that is a misunderstanding from the proposed enhancement of making
ganglia a little more resilient to DNS failures or mismatched multihoming
interfaces and that was the original driver behind the original wishlist.

 I hope very much that doesn't happen!

Agree as the current KISS scheme is scalable (to a certain extent) easy
to manage and intuitive and a key differentiator with other solutions
that weren't as lucky as us to have Matt doing the original design.

If it ever happens it should be most likely considered only as an option
while the current scheme will be supported forever.

The possibility to implement a different scheme than the current one that
uses an ondisk tree of clusters/nodes is available (even if probably not yet
complete and missing any needed frontend logic to support it) from our
development version (trunk or 3.2.x) through the python gmetad rewrite and
its modular architecture.

 It is very very useful to be able to ls the rrd directories to see what
 hosts gmetad sees, which ones it thinks are in which cluster; to run finds
 or ls's to see the datestamps on RRD files and immediately connect them to
 hosts; and to be able to rename hosts and rename their corresponding
 directories easily.

Agree and from what I'd seen scales well to even thousands of nodes if enough
IO is provided either through very fast disks, multiple mount points, RAIDs,
SAN or even hacky alternatives like a file based loopback device or ramfs.

Some people might find it more useful (even if it will be prohibitively more
expensive) to store the metrics in some sort of DBMS and use an SQL interface
instead of simple filesystem commands, or will have problems in their setup
that prevents them from having a unique FQDN generating conflicts and so this
is just an option for them.

Don't worry that the core principles for ganglia of Simplicity, Efficiency
and Correctness are still valid; after all this all started as a way to
monitor HPC clusters in a global scale and we are still aiming for world
domination last time I checked ;).

Carlo

-
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK  win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100url=/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


  1   2   >