Re: [Ganglia-general] A survey of Ganglia users and usage.

2007-04-03 Thread Richard.Grevis
Chris, I fully agree with your clean and simple comment. Part of Ganglia's real strength is what it doesn't have, rather than what it does. Examples: - metric data is not written locally on the monitored host - The metric set is fixed in compiled code. - No ability to customise graphs. - No

[Ganglia-general] Ganglia Windows agent binaries.

2007-04-03 Thread Richard.Grevis
FYI. http://www.aouk83.dsl.pipex.com has a link to a cygwin based windows agent (not as an installer package though), and also a link to a WMI native Ganglia agent coded by APR consulting in Switzerland. Enjoy. Richard Grevis Production Architecture Barclays Capital, Canary Wharf, London, E14

[Ganglia-general] A survey of Ganglia users and usage.

2007-04-02 Thread Richard.Grevis
All, Like many Ganglia users, we have modified the PHP a lot, changed some C code a bit, and added a whole lot of functionality by creating scripts of various flavours. I have also have entirely failed to push these mods back to the community, and one reason for this is that I have no idea how

Re: [Ganglia-general] A survey of Ganglia users and usage.

2007-04-02 Thread Richard.Grevis
I've been under the impression for a while ganglia wasn't getting a whole lot of development and was mostly in maintenance mode. It hasn't changed a whole lot in the few years I've been using it (except perhaps the config file format, a change that was much appreciated) You are

Re: [Ganglia-general] Not getting something

2007-03-30 Thread Richard.Grevis
Michael, Use different multicast addresses for each cluster, unless you are sure the multicast can't leak from 1 cluster to another. Remember that when you list hosts after the data_source for gmetad.conf that is for resilience only. You do not have to mention all nodes in the cluster there.

Re: [Ganglia-general] Gmetad and web frontend on different machines.

2007-03-29 Thread Richard.Grevis
Saundry, It sort of looks like you can, but actually you can't. gmetad writes to rrd databases as local files, and the web and php read rrd databases as local (actually it invokes rrdtool itself). I imagine you could separate the two using NFS filessystems, but I have not tried this. kind

Re: [Ganglia-general] Ganglia custom Round-Robin archives RRA

2007-03-29 Thread Richard.Grevis
You will have to remove the old rrds to allow your new definition to be applied. The RRA is only used at the initial creation of each rrd file. If you want to keep your old data, you will have to do magic (dump/export/import/perl-script) regards, Richard Grevis Production Architecture

Re: [Ganglia-general] gmond getting stuck

2007-01-18 Thread Richard.Grevis
4BB *DDI : +44 (0) 20 7773 4915 * richard.grevis -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On apr_socket_send Behalf Of Bernard Li Sent: 15 January 2007 21:04 To: ganglia-general@lists.sourceforge.net Subject: [Ganglia-general] gmond getting stuck

Re: [Ganglia-general] Windows port issues

2007-01-09 Thread Richard.Grevis
7773 4915 * richard.grevis -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Vladimir Sent: 08 January 2007 02:22 To: Carlo Marcelo Arenas Belon Cc: ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Windows port issues

Re: [Ganglia-general] Windows port issues

2007-01-08 Thread Richard.Grevis
. The issue here is that only 1 cygwin dll version can run, so all your cygwin based processes must use the same dll. Richard Grevis Production Architecture, at least this week. Barclays Capital, Canary Wharf, London, E14 4BB *DDI : +44 (0) 20 7773 4915 * richard.grevis -Original Message

Re: [Ganglia-general] RRD update errors and timestamps

2007-01-08 Thread Richard.Grevis
) 20 7773 4915 * richard.grevis -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jason Faulkner Sent: 04 January 2007 04:16 To: [EMAIL PROTECTED]; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general

Re: [Ganglia-general] Windows port issues

2007-01-08 Thread Richard.Grevis
. Richard Grevis Production Architecture Barclays Capital, Canary Wharf, London, E14 4BB *DDI : +44 (0) 20 7773 4915 * richard.grevis -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Martin Knoblauch Sent: 04 January 2007 10:20 To: Vladimir Vuksan

Re: [Ganglia-general] Windows port issues

2007-01-08 Thread Richard.Grevis
4915 * richard.grevis -Original Message- From: Grevis, Richard: IT (LDN) Sent: 08 January 2007 11:52 To: 'Vladimir Vuksan'; ganglia-general@lists.sourceforge.net Subject: RE: [Ganglia-general] Windows port issues Importance: Low Hi, I never understood the what waitIO under

Re: [Ganglia-general] [Ganglia-developers] Correct counting of CPUs, Cores, Siblings (bz #84)

2007-01-05 Thread Richard.Grevis
Can I ask whether you will keep the existing semantics of the existing metrics unchanged? I would not be comfortable with my cpu loads (and cpu count) suddenly doubling or halved. Also remember about the cygwin agent build, which also processes from cygwin's /proc. kind regards, Richard

Re: [Ganglia-general] Server showing up as IP instead of DNS name

2006-11-20 Thread Richard.Grevis
Sam, I imagine this has already been well answered for you, but the host names you see are the result of a reverse DNS lookup on the headnode, or whatever node you get the XML from. You will get IP addresses if the reverse lookup failed, although the failure is at the headnode level - not from the

Re: [Ganglia-general] Display the same host in in two differentclusters

2006-10-27 Thread Richard.Grevis
Yes, Bernard is right. If you have configuration problems I usually recommend first trying a unicast configuration. And the only way you get a node to appear in 2 clusters is to configure the node agent itself to send data to two different headnodes. The above configuration is clunky to say the

Re: [Ganglia-general] how to get machines from different subnetsinto same cluster

2006-09-25 Thread Richard.Grevis
John, I assume you have configured for multicast and the multicast address you use does not travel outside the local subnets? That is your current situation? option 1 is to make a multicast address on the routers that scopes to all your subnets. option 2 is to unicast to 1 or 2 nominated

Re: [Ganglia-general] Gmetad RRD update question

2006-08-30 Thread Richard.Grevis
Dave, I tried this, and for me last_update gets changed at the same rate as my poll rate, so I don't see what you see. What is your poll rate of the cluster in gmetad.conf? Perhaps you could mail me the RRA lines in gmetad.conf and the full output of rrdtool info. I'm sure you aleady know

Re: [Ganglia-general] Obtaining Immediate Interval Data From Ganglia

2006-08-14 Thread Richard.Grevis
Absolutely agreed, subsecond makes no sense and the ganglia design is not appropriate for that anyway. I was originally asked to do 5 seconds, but I have increased that to 10 seconds as there was no meaningful change in the shape of the graphs anyway. But 10 second polling is useful to me for a

Re: [Ganglia-general] Obtaining Immediate Interval Data From Ganglia

2006-08-10 Thread Richard.Grevis
Ian, it is the gmetad process which write the rrd files, not gmond. Are you using rrdtool fetch to get the numbers? If you don't specify an end time, rrdtool will choose now, so it is almost certain you will have some Nan's at the end. What I do is to do a rrdtool last first, then use that value

Re: [Ganglia-general] Obtaining Immediate Interval Data From Ganglia

2006-08-10 Thread Richard.Grevis
If you do want to do fast polling on the Linux or cygwin gmond, I found some hardwired code in there which effectively limits the polling rate for some metrics no matter what you put in the config files. (Sorry martin, have not raised a bug report yet). Anyway: the code below is in the cygwin and

[Ganglia-general] reported time from gmond XML data leaping backwards.

2006-06-21 Thread Richard.Grevis
All, I am observing 2 problems occassionally occuring which may or may not be related. The first is that very rarely the time reported back to gmetad from the gmond XML will leap backwards by maybe a month and a half. Checking the time on the server running the gmond reveals that the server

Re: [Ganglia-general] Windows servers

2006-06-15 Thread Richard.Grevis
Ron, do the following: - choose one of your w2k servers as a headnode. - configure gmetad.conf to have a SINGLE data_source entry pointing to this headnode. - configure gmond.conf on all hosts (including headnode) to have: udp_send_channel { host = headnode-hostname

Re: [Ganglia-general] Ganglia Alert and Tracking

2006-06-12 Thread Richard.Grevis
Ahh yes, aggregating data in different ways after the fact. We had a need to do that, and also a need to provide more than one cluster heirachy (e.g. clusters grouped by region, but also clusters grouped by technology owner (say)). I have written some perl code to do this - sucking the data out

Re: [Ganglia-general] Problem with Windows gmond

2006-06-12 Thread Richard.Grevis
Ron, gmonds only send UDP data to ther gmonds. From your post it is unclear what is listening on 140.203.7.43 port 2344. It should be a gmond. To test a single host, you should configure a udp_send_channel on the w2k server to send data to its proper address or hostname (not 127.0.0.1 which is

Re: [Ganglia-general] New issue with hosts reporting

2006-06-05 Thread Richard.Grevis
Have you checked whether your reverse DNS entries are correct? The ganglia agents use the source address of the UDP packets that are transmitted to o a reverse DNS lookup to yield the hostname seen in the XML. - richard -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL

Re: [Ganglia-general] Questions about hostname and grid names

2006-06-05 Thread Richard.Grevis
Mark, the hostnames that you see in the web interface are the result of reverse DNS lookup of the IP addresses of the hosts in your clusters. You will find the differences there and this is what you have to change. Bear in mind that the host doing the reverse DNS lookup is the headnode for each

RE: [Ganglia-general] very annoying issue with jittery cluster graphs

2006-05-23 Thread Richard.Grevis
John, this may not particularly help you, but on your ganglia server I would try netcating localhost and checking out TN numbers for a start. e.g. nc localhost 8651 | grep 'HOST NAME' and check out TN values, or maybe just wc the above to see if the data is always coming in properly. Do

RE: [Ganglia-general] Compiling Ganglia on Windows

2006-05-17 Thread Richard.Grevis
Joshua, to the best of my knowledge, gmetad has never been compiled for windows. So you will not be able to run the server code under windows. Gmond and gmetric have been compiled by myself and others in a cygwin environment. There is no windows build documentation. The windows gmond does

RE: [Ganglia-general] Compiling Ganglia on Windows

2006-05-17 Thread Richard.Grevis
RRDtool - look to the site? rrdtool windows binary distributions - see page http://oss.oetiker.ch/rrdtool/download.en.html scroll down to binary distributions, http://oss.oetiker.ch/rrdtool/pub/?M=D and http://www.cacti.net/downloads/rrdtool/win32 Find documentation as required on his site. I am

RE: [Ganglia-general] Newbie question - gmond not returning any metrics

2006-05-16 Thread Richard.Grevis
Steve, it may seem strange, but that is the way gmond behaves. If in all your gmond instances you specify a single unicast headnode, the only place you will get the XML data payload is the headnode. The other nodes dump the DTD and nothing else. If you want to see the data on each of your

RE: [Ganglia-general] Metric pull-down menu not showing all metrics

2006-05-08 Thread Richard.Grevis
I was hoping that someone would do it properly! If I get time today I will get the patch working against 3.0.3 - richard -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Rick Mohr Sent: 04 May 2006 15:08 To: ganglia-general@lists.sourceforge.net Subject:

RE: [Ganglia-general] Metric pull-down menu not showing all metrics

2006-05-04 Thread Richard.Grevis
Ben, As you probably already know, the code is in header.php - if( $context == cluster ) { if (!count($metrics)) { echo h4Cannot find any metrics for selected cluster \$clustername\, exiting./h4\n; echo Check ganglia XML tree (telnet $ganglia_ip $ganglia_port)\n; exit;

RE: [Ganglia-general] 2 clusters same subnet

2006-04-28 Thread Richard.Grevis
Chris, with unicast, the cluster derives its name from the head-node configuration only. By head-node, I mean the nodes that appear in the gmetad configuration as you have detailed below. So in your case, if you have two separate head-nodes for each cluster, there is no need to use different

RE: [Ganglia-general] Ganglia reporting all nodes down...except 1

2006-04-20 Thread Richard.Grevis
I agree with Steve, Chris, I would suggest that (at least until you are more confident) that you change the gmond.conf config to be unicast to your headnode, i.e: udp_send_channel { #mcast_join = 239.255.160.2 host = your-headnode-hostname port = 8649 } And also confirm all seems OK by

RE: [Ganglia-general] Nodes Reported as Dead

2006-04-19 Thread Richard.Grevis
Chris, 3.0 gmond? This version of the agent will have the truncated XML problem, although I have only seen no element found errors on parse as opposed to what chris is seeing - which sounds like perhaps a partially constructed XML tree in gmetad memory which then blows up the subsequent? Chris

RE: [Ganglia-general] Nodes Reported as Dead

2006-04-18 Thread Richard.Grevis
Chris, possibility 1 - look for Possible bug in hosts up calculation when federating clusters in the mail archive. But if you are using the 3.0.3 release this should be fixed. The reason that the XML stream version affects hosts up is because the test of liveness changed between before 2.5, and

RE: [Ganglia-general] gmetad not updating RRD's/hosts that are proper in gmond XML

2006-03-30 Thread Richard.Grevis
This is the classic behaviour that comes from a trunucated XML stream. There is now a full patch for this in the CVS repository. But if you suffer from this, you should get a /var/log/messages entry like: Mar 30 09:56:37 ldndsr0163 /apps/ganglia/sbin/gmetad[15336]: Process XML (LDN FIP QA

RE: [Ganglia-general] gmond unreliable on one cluster, must be constantly restarted

2006-03-30 Thread Richard.Grevis
There are a few simple and obvious steps. BTW, it is good that TN is greater than TMAX in some sense, because this means that gmetad and all the php stuff is not saying anything that is wrong wrt to the XML stream. So have you done a simple tcpdump of UDP port 8649 on the headnode? Do the UDP

[Ganglia-general] A script that checks clusters for down and duplicated hosts in clusters

2006-03-30 Thread Richard.Grevis
Fresh off the presses - others may find it useful too. This iterates through your clusters and finds dead hosts or duplicated host entries. Note that you can't find duplicated host entries by netcatting gmetad port 8651. You must do it as below: You will need to compile or otherwise have netcat

RE: [Ganglia-general] gmond stops recognizing the rest of the cluster

2006-03-22 Thread Richard.Grevis
Steven, if the problem is routing or actual packet loss, then that should be reflected by the XML output of the master gmond - the down host will have a TN (much) greater than the TMAX. e.g.: HOST NAME=ldndsm03185.intranet.barcapint.com IP=10.68.90.10 REPORTED=1143022788 TN=145 TMAX=20

[Ganglia-general] Early termination of XML stream from a windows based agent.

2006-02-10 Thread Richard.Grevis
All, I just know that no-one else is doing this, but I updated the windows gmond with a current cygwin install and fixed the processor count metric. That is all I did. Simple recompile, slightly newer cygwin1.dll However when I used this agent, when gmetad did the tcp poll, instead of the

RE: [Ganglia-general] config file confusion

2006-02-09 Thread Richard.Grevis
Exactly. ~Richard -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jason A. Smith Sent: 09 February 2006 18:04 To: [EMAIL PROTECTED] Cc: Ganglia General Subject: Re: [Ganglia-general] config file confusion Just a guessyou probably have all 500 nodes

RE: [Ganglia-general] gmetrics in cluster/grid view

2006-01-31 Thread Richard.Grevis
The php, being what it is, kind of encourages everyone to do their own thing. The problem is which changes are appropriate for the whole community, and which are appropriate to only a few. The second problem is an engineering one. Hacks are easy - it is usually what you do first. Generating a

RE: [Ganglia-general] Pointers on architecting a large scale ganglia setup??

2006-01-27 Thread Richard.Grevis
My experience so far: RRD files on ramdisk is a good idea. RRD is very basic with its I/O, it writes as soon as it gets a data point (and reads as well). In my case, a simple blade engnieering server with simple local disk was really being hammered with 100 nodes, except at a 5 second poll, with

RE: [Ganglia-general] intermittent blanks in graphs

2006-01-25 Thread Richard.Grevis
There is another way this failure can occur, although it is unlikely (it happened to me though). gmond appears to do a reverse IP lookup of the udp packets' source address to generate the hostname in the XML. We had an error in the reverse DNS, and 2 separate hosts in the cluster ended up having

RE: [Ganglia-general] intermittent blanks in graphs

2006-01-25 Thread Richard.Grevis
Call me old fashioned, but: who | wc -l | awk '{print $1}' strikes me as safer regard, richard -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Martin Knoblauch Sent: 25 January 2006 09:25 To: Ben Hartshorne; ganglia-general@lists.sourceforge.net

[Ganglia-general] Solaris 8 gmond and gmetric.

2005-12-05 Thread Richard.Grevis
Does anyone have compiled solaris 8 binaries they can send me? My efforts to compile ganglia on solaris 8 with gcc did not get very far. Not to sure why but if someone can send me the binaries, that would make me a happy puppy. kind regards, Richard ps, if interested: Making all in examples

RE: [Ganglia-general] windows gmond client

2005-11-10 Thread Richard.Grevis
Exactly. I should have been clearer. The default windows/cygwin client is neither correct enough (cygwin's fault) nor provides all the metrics we want (in fact, because some of our farms are not just HPC farms, we want some other metrics as well). I remain grateful to whoever developed it,

[Ganglia-general] windows gmond client

2005-11-07 Thread Richard.Grevis
All, against all probability, but for reasonable historical reasons, we run windows based HPC applications. We also have large networks of similar function windows farms (e..g. web farms). We want to improve the visibility of the state of our estate, we like ganglia (rollups and all that), and