Re: [Ganglia-developers] Re: [Ganglia-general] Ganglia issues I've been experiencing
--- Matt Massie [EMAIL PROTECTED] wrote: actually. i just updated gmetad to allow custom RRAs to be defined. i just dropped the code into CVS so if you use the CVS code (which will be released as 3.0.1 very soon)... you can specify RRAs RRA:AVERAGE:0.5:1:240 \ RRA:AVERAGE:0.5:24:240 \ RRA:AVERAGE:0.5:168:240 \ RRA:AVERAGE:0.5:672:240 \ RRA:AVERAGE:0.5:5760:370 in gmetad.conf to alter the round-robin archive format. this was a simple feature to add and i know it's in big demand ... no sense waiting until later to add it. forget everything that i wrote below... just use CVS for now or wait for 3.0.1. :) -matt Matt, I assume above settings are what we are using today? Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
[Ganglia-general] New ganglia-3.0.2 snapshot
Hi, the hopefully last snapshot of ganglia-3.0.2 has been downloaded to http://www.knobisoft.de/ganglia/ganglia-3.0.2.200511021403.tar.gz Please test, especially on non-Linux-ia32 platforms. If no serious regressions show up, this could be 3.0.2. Changes compared to 3.0.1 are: Changes since 20051018: - More compile fixes for MacOS/Tiger - Fix references to www.rrdtools.org - Fix umask screwup for new --pid-file option Changes since 20051004: - Bugzilla 72: --pid-file for gmetad/gmond - Bugzilla 70: Fix Debian /dev2/ weirdness - Bugzilla 68: gmond now honors location via commandline and config file. - Bugzilla 49: Cleanup php for web-frontend - Bugzilla 27: Let gmetad reconnect to the last good source - New AIX metrics code More changes: - Fix 64-bit core-dumps for disk metrics on Linux - Lots of compile time watnings - ... Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Gmetric Repository offline?
Hi Ken, I am afraid that we are victim of the MySQL changes at SF - which we apparently ignored :-( Matt, could you contact them and ask about moving our stuff? Cheers Martin --- Kenneth Young [EMAIL PROTECTED] wrote: Hi all, Opening browser to http://ganglia.info/gmetric/ returns the following error. Is the page temporarily offline? *Warning*: mysql_connect(): Can't connect to MySQL server on 'mysql' (113) in */home/groups/g/ga/ganglia/htdocs/gmetric/header.php* on line *6* Could not connect to database Ken Young --- SF.Net email is sponsored by: Tame your development challenges with Apache's Geronimo App Server. Download it for free - -and be entered to win a 42 plasma tv or your very own Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] TCP/IP Bad Data
Hi, hmm. interesting. Was the Bytes-Out the only metric showing problems at that time? What about Packets-Out. Loss of metric data is not unheard of, but only one metric affected is strange. What platform and version (gmond, gmetad web-frontend) are you running? Cheers Martin --- G. Francisco Perin [EMAIL PROTECTED] wrote: I am having a strange issue with ganglia reporting (not reporting) network traffic on a high volume web site. The graphs are reporting points of zero (0) data when SAR data is not showing the same information. Its disconcerting because if Ganglia is reporting good information then I have a problem. But all other indications are that things are fine. Here's an example of the SAR data for the interface: Time Rx Tx 05:14:00 845.78 945.85 05:15:00 722.51 816.59 05:16:00 752.33 840.65 05:17:00 796.62 886.42 05:18:00 888.71 990.34 05:19:00 802.17 891.59 05:20:00 760.13 851.22 05:21:00 797.95 908.55 05:22:00 909.58 1009.56 And attached is a graph from ganglia during the same time period. Notice there is a big drop in RX between 5:15-5:20? Any ideas what might be causing the chart to do this? Do you think I am looking at a real problem or something with ganglia reporting? -- cp -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] problem with SummaryInfo
Hi Branimir, those servers look great. What are they? :-) Anyway, could you please post the two different gmond.conf files and the gmetad.conf file? I have the impression that the machines in the two groups do not see each other. At least one machine in each group should see the metrics of its partner machines. In gmetad.conf you would use that machine as data source. Basically you should only have two data sources in your gmetad.conf Simple test. Log into one of the servers and do a telnet localhost gmond-port. It should show you the data of all hosts in that group (grep for HOST NAME). If it only shows its own data you have found the problem. Cheers Martin --- Branimir Ackovic [EMAIL PROTECTED] wrote: Hi, I configured Ganglia 3.0.1 to monitor Grid site with 4 servers and 8 nodes. I put it in two groups: AEGIS01-PHY-SCL Core Services and AEGIS01-PHY-SCL There is problem with summary report. I see only one node in each of this sources. I also have problem with grid summary because it use source summary. You can see it on: http://se.phy.bg.ac.yu/site/ganglia How can I configure Ganglia to see all propertly. All servers have in /etc/gmond.conf: cluster { name = AEGIS01-PHY-SCL Core Services } and all nodes have in /etc/gmond.conf: cluster { name = AEGIS01-PHY-SCL } There is gmetad and web frontend on one of servers (se.phy.bg.ac.yu/site/ganglia). In /etc/gmetad.conf I put: data_source AEGIS01-PHY-SCL Core Services1 ce.phy.bg.ac.yu data_source AEGIS01-PHY-SCL Core Services2 se.phy.bg.ac.yu data_source AEGIS01-PHY-SCL Core Services3 grid.phy.bg.ac.yu data_source AEGIS01-PHY-SCL Core Services4 rb.phy.bg.ac.yu data_source AEGIS01-PHY-SCL1 wn01.phy.bg.ac.yu data_source AEGIS01-PHY-SCL2 wn02.phy.bg.ac.yu data_source AEGIS01-PHY-SCL3 wn03.phy.bg.ac.yu data_source AEGIS01-PHY-SCL4 wn04.phy.bg.ac.yu data_source AEGIS01-PHY-SCL5 wn05.phy.bg.ac.yu data_source AEGIS01-PHY-SCL6 wn06.phy.bg.ac.yu data_source AEGIS01-PHY-SCL7 wn07.phy.bg.ac.yu data_source AEGIS01-PHY-SCL8 wn08.phy.bg.ac.yu and gridname AEGIS01 PHY SCL - Branimir Ackovic E-mail: [EMAIL PROTECTED] Web: http://scl.phy.bg.ac.yu/ Phone: +381 11 3160260, Ext. 152 Fax: +381 11 3162190 Scientific Computing Laboratory Institute of Physics, Belgrade Serbia and Montenegro - --- SF.Net email is sponsored by: Tame your development challenges with Apache's Geronimo App Server. Download it for free - -and be entered to win a 42 plasma tv or your very own Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] problem with SummaryInfo
Hi Branimir, apparently Rick pushed you into the right direction already :-) Just a few comments Martin --- Branimir Ackovic [EMAIL PROTECTED] wrote: Thank You Rick and Martin for quick response! I allready tried configuration that Rick suggest, but it doesn't work. In that configuration I see only one node per data_source (the last one). One week ago, Michael Chang helped me to solve problem with this configuration: data_source AEGIS01-PHY-SCL1 147.91.83.201 data_source AEGIS01-PHY-SCL2 147.91.83.202 data_source AEGIS01-PHY-SCL3 147.91.83.203 ... If I understand, Martin suggest that I need two machines with gmetad (one for each data_source). Now I have gmetad only on server with web frontend (se.phy.bg.ac.yu). that is totally fine. You only need one gmetad running. Your problem was that the nodes *within* your two *clusters* did not communicate correctly. MCs setup allowed you to query each node individually, but you lost the cluster concept that way. It is true that the machines in the two groups do not see each other. Even in same group. I tried: [EMAIL PROTECTED] root]# telnet localhost 8649 | grep grid Connection closed by foreign host. [EMAIL PROTECTED] root]# Both machines ce and grid are in the same data_source with same gmond.conf files. As you said, Martin, I found the problem, but I don't found solution for them. :( That was the most important step :-). Your gmond.conf files look like a multicast setup, but apparently sometning went wrong. Possible causes: - no route for the multicast IP - your switch does not like IGMP - also, both of your clusters were talking on the same port. This can be a problem with MC. So, going unicast is the right way to go in my opinion. Advantages are: - your networking infrastructure will not screw you up - less network traffic. In a working multicast network you will have N*N messages going around. In a large cluster that can be a lot of traffic just for Ganglia. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] windows gmond client
Hi Richard, --- [EMAIL PROTECTED] wrote: All, against all probability, but for reasonable historical reasons, we run windows based HPC applications. What kind of HPC stuff is a financial institution running? Just curious :-) If we were to cheat, and create a windows agent that only produced the XML via the tcp interface, and not the udp niceness, can anyone give me an idea of how this will scale? This obviously moves more work to gmetad. Will gmetad poop with 5 data sources, 100? Not knowing the Cygwin implementation at all, but what is wrong with using the unicast TCP setup. Just select one or two nodes per *cluster* to run gmond in TCP receive mode and let all other nodes send data to them. Use the selected node(s) as data source for gmetad. Much better network usage compared to the multicast mode, which produces traffic going up with N*N. And you don't have to worry about switches blocking IGMP traffic. 5 Datasources schould be no problem for gmetad. I have no idea about 100 or more. Can someone suggest something clever to get windows node producing ganglia data in a lightweight way? This likely needs a native client. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] gmond configuration question
Hi Prakash, basically what you describe is the expected behaviour. Without the extra routing information, the multicast packet will be sent through the default gateway interface, which is eth0 for all three groups. As a result group 2 and 3 end up disconnected from group 1. You should use a different extra route though. Do a: % route add -host 239.2.11.71 dev eth1 That will keep the default routes for group 2 and 3, but all packets from the gmond multicast group will go through eth1. This is, btw., in the FAQ. Another solution is to drop multicast and move to unicast communication. Select one or two of your nodes as gmond receivers and have the following directives in gmond.conf (you need 3.0.1 for that): udp_send_channel { host = 192.168.2.X port = 8649 } udp_send_channel { host = 192.168.2.Y port = 8649 } udp_recv_channel { port = 8649 } The nodes X and Y will then have all information from the other nodes and can be used as redundant data sources for gmetad. Hope this helps Martin --- Prakash Velayutham [EMAIL PROTECTED] wrote: Hi, Could someone explain how my configuration directives should be with the following setup? Total of 18 compute nodes 5 compute nodes with eth0 connected to a switch. (192.168.2.* network) 6 compute nodes with eth1 connected to this switch (192.168.2.* network) and eth0 connected to a different switch (10.1.21.* network) 7 compute nodes with eth1 connected to this switch (192.168.2.* network) and eth0 connected to a different switch (10.1.74.* network) The routing table on each group of nodes looks like this Group 1: Destination Gateway Genmask Flags Metric Ref Use Iface 192.168.2.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0 127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo 0.0.0.0 192.168.2.254 0.0.0.0 UG0 0 0 eth0 Group 2: Destination Gateway Genmask Flags Metric Ref Use Iface 192.168.2.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1 10.1.21.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0 127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo 0.0.0.0 10.1.21.1 0.0.0.0 UG0 0 0 eth0 Group 3: Destination Gateway Genmask Flags Metric Ref Use Iface 10.1.74.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 192.168.2.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth0 127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo 0.0.0.0 10.1.74.1 0.0.0.0 UG0 0 0 eth0 When I set the default configuration for all the nodes (without a mcast_if directive), each of these groups of nodes only show up within their subnet, so the collection agent only sees one group of nodes (depending on which node is first in the data_source line for that cluster). Later I set the configuration for the first group as default and changed the configuration for the rest of the nodes by adding an mcast_if eth1 to the udp_send_channel and udp_recv_channel groups, but still the result is the same. I get the desired result of all nodes multicasting to all the other nodes only when I add the following route to the tables of the nodes in group 2 group 3. Is there a reason why and is there a way around it. If I do this change to the routing table, I lose the ability to login directly to a node. 0.0.0.0 192.168.2.254 0.0.0.0 UG0 00 eth1 Hoping to get an answer to this rather intriguing issue. Thanks, Prakash --- SF.Net email is sponsored by: Tame your development challenges with Apache's Geronimo App Server. Download it for free - -and be entered to win a 42 plasma tv or your very own Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] gmond configuration question
Hi Prakash, please send gmond.conf (you are using the same file for all three groups, I suppose) and gmetad.conf. Likely something simple. Unicast is usually pretty simple to setup. Are you using Version 3.0.1? §.0 had some problems. Cheers Martin --- Prakash Velayutham [EMAIL PROTECTED] wrote: For some reason, only the route solution works for me. The unicast packets do not seem to reach the collection agent in the first group of nodes. The route solution works though, giving some relief. Thanks, Prakash Martin Knoblauch [EMAIL PROTECTED] 11/07/05 6:49 PM Hi Prakash, basically what you describe is the expected behaviour. Without the extra routing information, the multicast packet will be sent through the default gateway interface, which is eth0 for all three groups. As a result group 2 and 3 end up disconnected from group 1. You should use a different extra route though. Do a: % route add -host 239.2.11.71 dev eth1 That will keep the default routes for group 2 and 3, but all packets from the gmond multicast group will go through eth1. This is, btw., in the FAQ. Another solution is to drop multicast and move to unicast communication. Select one or two of your nodes as gmond receivers and have the following directives in gmond.conf (you need 3.0.1 for that): udp_send_channel { host = 192.168.2.X port = 8649 } udp_send_channel { host = 192.168.2.Y port = 8649 } udp_recv_channel { port = 8649 } The nodes X and Y will then have all information from the other nodes and can be used as redundant data sources for gmetad. Hope this helps Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
[Ganglia-general] 3.0.2 is released
Hi, this is to notify you of the release of Ganglia 3.0.2. Below is the description from SF. The homepage still needs to be updated, but you can download the tarball. Hopefully RPMs for some platforms will follow soon. If you find bugs with 3.0.2, please continue to use the bugzilla service at: http://bugzilla.ganglia.info/ Cheers Martin - The Ganglia Development Team is pleased to announce the release of Ganglia 3.0.1 (Wilbur) which is available for immediate download from http://ganglia.info/downloads.php. This release is mainly fixing bugs. For a detailed description of the changes see the Changelog included in the tarball. Some of the highlight are: - New AIX metrics code - NetBSD support - --pid-file option for gmond and gmetad - Old gmond location staments are now handled correcly - gmond --location now works correctly - Compile fixes for MacOS Tiger - Gmond no longer core-dumps on 64-bit Linux platforms - cpu_wio is now reported correctly - PHP fixes in the web-frontend - ... The following Bugzilla entries are adresses: 27, 49, 54,62, 63, 68, 70, 72. This release has been tested on the following platforms: - Fedora FC4 / ia32 - SuSe 9.0 / x86_64 - RHEL3 / ia64 - Mac OS Tiger - Solaris 2.8 / Sparc-64 - AIX 5.2, 5.3 Enjoy The release team -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
RE: [Ganglia-general] windows gmond client
--- [EMAIL PROTECTED] wrote: Exactly. I should have been clearer. The default windows/cygwin client is neither correct enough (cygwin's fault) nor provides all the metrics we want (in fact, because some of our farms are not just HPC farms, we want some other metrics as well). I remain grateful to whoever developed it, none-the-less. As I said, I do not know the Cygwin client very well. I am not a windows man, but we are looking at the possibility of developing a fully native (no cygwin) client ourselves. The reason for the TCP question is that my feeling was that it would be much easier to produce a native first pass windows gmond client deliverying TCP only, rather that all that clever UDP stuff as well. Not really liking Windows myself, I believe the contribution of a native client would be very welcome. But of course with the TCP route, I have fears of scaling. But there is a GEM in Martins reply (and a Doh moment for me), in that I assumed that every node would have to be polled by a gmetad to get the cluster info. But you remind me this is not so, I can do the structural equivalent of the udp unicast to a head node using TCP to a head node, that gmetad then interogates. Have I got this right guys? Unfortunatelly I think the answer is no. I made the mistake to somehow associate gmond unicast with TCP which is wrong. Communication between the gmonds in a host group is always UDP. One ore more clients listen, while all push their data out (either multicast, or unicast). But you are right that gmetad only needs to communicate with the heads of the host groups. This communication is TCP. And the other thing for the community is asking whether anyone else out there is considering developing a native windows gmond. not me :-) Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Fwd: Solaris-first page works-selecting drop down buttons fails
--- michael chang [EMAIL PROTECTED] wrote: Because expat has no check target ? :-) Question is how to fix that. expat is one of the external packages and I do not want to mess with it if possible. Maybe ask if upstream will accept a blank check target, have a proper set of checks, or put in a ganglia-specific patch that returns true on a check call, I suppose... it is even simpler :-) The expat project has a check target since 1.95.2. Our version is just very old. I have checked in a dummy target to make the process happy. Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
RE: [Ganglia-general] Gmetad and rrd
--- [EMAIL PROTECTED] wrote: So then your implying that gmetad has the intelligence to create it own rrds databases? Not sure about the intelligence of gmetad (or any other computer program :-), but it will create the rrds on the fly. There are only two requirements: - the root of the rrds tree (defined in gmetad.conf) has to exists - gmetad needs write permission to it Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Grid with over 4TB mem
Hi Alex, do the gmond's themselves report the correct size for each host? Where do you get the problem? In gmetad, or in the webfrontend? Those are separate pieces of software. Cheers Martin --- Alex Balk [EMAIL PROTECTED] wrote: Hi all, I'm running Ganglia on a grid with thousands of hosts, each with at least 4GB RAM installed. Since aggregating the total amount of RAM exceeds 4294967296 (the max value of uint32), I'm getting incorrect data on the total memory in the grid. I've peeked at gmond and gmetad code and adding a uint64 doesn't seem trivial. I've also searched the Net for anyone that's possibly encountered this before and came up only with this: http://sourceforge.net/mailarchive/forum.php?forum_id=9584max_rows=25style=nestedviewmonth=200312 It mentions that Ganglia 3.x should solve the problem. I'm running 3.0.2, but still experiencing it. I've also failed to find any method for changing collection of memory metrics from KB to MB, other than modifying the source code, which I'd rather avoid so as to keep as close to the original tree as possible. My questions are: 1. Is there really a solution in Ganglia 3.x and if so, what is it? 2. If not, are you aware of anyone who's implemented such a solution or documented the work needed, so I may carry on from there? Note that the gmetad collectors are running on a SuSE x86_64 machine and were compiled with 64bit libs. Ganglia is deployed in a hierarchy and reporting is done via unicast. In essence, this isolates the problem to the gmetads only, as the gmonds report on groups with less than 4TB RAM (but it may definitely surface there someday as well). Thanks, Alex --- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637alloc_id=16865op=click ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Unicast issue
Markus, if you want unicast, I would leave out the bind thing. That is for multicast, AFAIK. telnet w.x.y.z 8649 Should give you a correct list of metrices. Cheers Martin --- Markus Törnqvist [EMAIL PROTECTED] wrote: Hi! I'm experiencing the weirdest issue here with unicasting; not even the mail archives helped so I hope someone here can give me a hand. Shouldn't it suffice to have the config file look like this: udp_send_channel { host = w.x.y.z port = 8649 } udp_recv_channel { bind = w2.x2.y2.z2 port = 8649 } for those parts? Nothing anywhere that points to multicasts? Right now, with that kind of configuration, I get an empty result set; GANGLIA_XML VERSION=3.0.2 SOURCE=gmond CLUSTER NAME=unspecified LOCALTIME=1133291540 OWNER=unspecified LATLONG=unspecified URL=unspecified /CLUSTER /GANGLIA_XML Connection closed by foreign host. It's somewhat annoying because we can't use multicast really and even if we did it seems some very faux IPs are sent back, which may be another error on my part, but irrelevant if it's due to multicasting.. Any help is highly appreciated, thanks! -- mjt -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Unicast issue
Markus, that is still a multicast configuration. Remove both binds and the mcast_join. Could you post the complete gmond.conf (IP-censored, if you must)? Thanks Martin --- Markus Törnqvist [EMAIL PROTECTED] wrote: /* You can specify as many udp_recv_channels as you like as well. */ udp_recv_channel { /* mcast_join = 239.2.11.71 bind = 239.2.11.71 bind = p.q.r.s */ port = 8649 } -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Unicast issue
Ramon, Markus, actually, below one works fine for me. The same config file is used on all gmond-hosts in the cluster (actually pretty beautiful :-). - host 172.17.17.103 receives the metrics from all participating gmonds. - all other hosts will report empty metrics if queried. If you want them to report their own metrics, add a upd_send_channel for localhost. - host 172.17.33.108 is the only one allowed to query the TCP port. This is the host where gmetad would be running (no gmond necessary on this host). If you leave out the acl all hosts may query the TCP port. The bind in the udp_recv_channel maybe needed if you have more than one network interface and the traffic does not come on the first one. For the upd-send-channel, no bind should ever be *neccessary*. But I am really not sure about this. udp_send_channel { host = 172.17.17.103 port = 9649 } udp_recv_channel { port = 9649 } tcp_accept_channel { acl { default = deny access { ip = 172.17.33.108 mask = 32 action = allow } } port = 9649 } - Cheers Martin --- Ramon Bastiaans [EMAIL PROTECTED] wrote: Actually, bind is needed to specify what local ip to bind to and listen on in a unicast setup. mcast_join is used when listening to multicasting. However, why are you using 2 different ip adresses in the recv and send channel? This will never work. You need to set you send channel to the same ip/port as your recv channel. Else you are sending the information to 1 place and listening for that information on another place. Kind regards, - Ramon. Martin Knoblauch wrote: Markus, if you want unicast, I would leave out the bind thing. That is for multicast, AFAIK. telnet w.x.y.z 8649 Should give you a correct list of metrices. Cheers Martin --- Markus Törnqvist [EMAIL PROTECTED] wrote: Hi! I'm experiencing the weirdest issue here with unicasting; not even the mail archives helped so I hope someone here can give me a hand. Shouldn't it suffice to have the config file look like this: udp_send_channel { host = w.x.y.z port = 8649 } udp_recv_channel { bind = w2.x2.y2.z2 port = 8649 } for those parts? Nothing anywhere that points to multicasts? Right now, with that kind of configuration, I get an empty result set; GANGLIA_XML VERSION=3.0.2 SOURCE=gmond CLUSTER NAME=unspecified LOCALTIME=1133291540 OWNER=unspecified LATLONG=unspecified URL=unspecified /CLUSTER /GANGLIA_XML Connection closed by foreign host. It's somewhat annoying because we can't use multicast really and even if we did it seems some very faux IPs are sent back, which may be another error on my part, but irrelevant if it's due to multicasting.. Any help is highly appreciated, thanks! -- mjt -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de --- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637alloc_id=16865op=click ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- .. | ing. Ramon Bastiaans | | HPC - Systems Programmer | || | SARA - Computing and Networking Services | | Kruislaan 415 PO Box 194613 | | 1098 SJ Amsterdam 1090 GP Amsterdam | || | Mail: bastiaans ( a t ) sara ( d o t ) nl | | Web: http://www.sara.nl/ | | Phone: +31 (0)20 592 80 19 | | Fax: +31 (0)20 668 31 67 | `' -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Unicast issue
Hi, some more info: - udp_send_channel does not have a bind attribute, just forget my comment below. Looking at the code sometimes helps. - udp_recv_channel: if you specify mcast_join and bind with different IP adresses, no unicast processing will take place (from the gmond.conf man page) And forget the comment about localhost. It is a bit more complicated like that Martin --- Martin Knoblauch [EMAIL PROTECTED] wrote: Ramon, Markus, actually, below one works fine for me. The same config file is used on all gmond-hosts in the cluster (actually pretty beautiful :-). - host 172.17.17.103 receives the metrics from all participating gmonds. - all other hosts will report empty metrics if queried. If you want them to report their own metrics, add a upd_send_channel for localhost. - host 172.17.33.108 is the only one allowed to query the TCP port. This is the host where gmetad would be running (no gmond necessary on this host). If you leave out the acl all hosts may query the TCP port. The bind in the udp_recv_channel maybe needed if you have more than one network interface and the traffic does not come on the first one. For the upd-send-channel, no bind should ever be *neccessary*. But I am really not sure about this. udp_send_channel { host = 172.17.17.103 port = 9649 } udp_recv_channel { port = 9649 } tcp_accept_channel { acl { default = deny access { ip = 172.17.33.108 mask = 32 action = allow } } port = 9649 } - Cheers Martin --- Ramon Bastiaans [EMAIL PROTECTED] wrote: Actually, bind is needed to specify what local ip to bind to and listen on in a unicast setup. mcast_join is used when listening to multicasting. However, why are you using 2 different ip adresses in the recv and send channel? This will never work. You need to set you send channel to the same ip/port as your recv channel. Else you are sending the information to 1 place and listening for that information on another place. Kind regards, - Ramon. Martin Knoblauch wrote: Markus, if you want unicast, I would leave out the bind thing. That is for multicast, AFAIK. telnet w.x.y.z 8649 Should give you a correct list of metrices. Cheers Martin --- Markus Törnqvist [EMAIL PROTECTED] wrote: Hi! I'm experiencing the weirdest issue here with unicasting; not even the mail archives helped so I hope someone here can give me a hand. Shouldn't it suffice to have the config file look like this: udp_send_channel { host = w.x.y.z port = 8649 } udp_recv_channel { bind = w2.x2.y2.z2 port = 8649 } for those parts? Nothing anywhere that points to multicasts? Right now, with that kind of configuration, I get an empty result set; GANGLIA_XML VERSION=3.0.2 SOURCE=gmond CLUSTER NAME=unspecified LOCALTIME=1133291540 OWNER=unspecified LATLONG=unspecified URL=unspecified /CLUSTER /GANGLIA_XML Connection closed by foreign host. It's somewhat annoying because we can't use multicast really and even if we did it seems some very faux IPs are sent back, which may be another error on my part, but irrelevant if it's due to multicasting.. Any help is highly appreciated, thanks! -- mjt -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de --- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637alloc_id=16865op=click ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- .. | ing. Ramon Bastiaans | | HPC - Systems Programmer | || | SARA - Computing and Networking Services | | Kruislaan 415 PO Box 194613 | | 1098 SJ Amsterdam 1090 GP Amsterdam | || | Mail: bastiaans ( a t ) sara ( d o t ) nl | | Web: http://www.sara.nl/ | | Phone: +31 (0)20 592 80 19 | | Fax: +31 (0)20 668 31 67 | `' -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
RE: [Ganglia-general] Unicast issue
Hi Lawrence, --- [EMAIL PROTECTED] wrote: Hi: My gmetad host is a double NIC machine that runs gmonds and servers as the head node to on cluster. Gmond runs on the workers nodes. I cannot get the webfrontend to display statistics for the worker nodes. Do you get any error messages in your webservers logfiles? Is the webserver the same as the gmetad server? From the gmetad host I can successfully get output from: telnet node1 8649, where node1 is a worker node. This sounds good. gmetad is basically doing the same. On the host running the webserver, can you try to connect to gmetad. There are two ports. The XML port (default 8651) and the interactive port (default 8652) that the webfrontend uses. $telnet gmetad-host 8651 $telnet gmetad-host 8652 quit Here is my gmond.conf file: /* global variables */ globals { mute = no deaf = no debug_level = 0 setuid = yes user=nobody gexec = yes host_dmax = 3600 } /* info about cluster */ cluster { name = X owner = latlong = N37.0303 W76.34 url=http://xxx.xxx.xxx/web; } /* info about host */ host { location = } /* channel to send multicast on mcast_channel:mcast_port */ udp_send_channel { mcast_join = 239.2.11.71 port = 8649 ttl=1 /* mcast_if = eth1 */ } /* channel to receive multicast from mcast_channel:mcast_port */ udp_recv_channel { mcast_join = 239.2.11.71 port = 8649 bind = 239.2.11.71 /* mcast_if = eth1 */ } /* channel to export xml on xml_port */ tcp_accept_channel { port = 8649 /* your trusted_hosts assuming ipv4 mask*/ acl{ default=deny access { ip=10.1.1.0 mask = 24 action = allow } access { ip=xxx.xxx.xxx.xxx mask = 32 action = allow } } } .. .. .. Is it best to use unicast here? I don't understand why this wont work. Thanks Hard to what is better. How big is your cluster? Multicast will likely create more traffic than unicast. Also some switches create trouble for multicast. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia fails after building oscar cluster.
Hi Satish, first of all: which version of ganglia? Cheers Martin --- grid computing [EMAIL PROTECTED] wrote: Dear All, We are building and oscar Cluster using oscar 4.0. on redhat 9.0. The installation goes through fine. Every thing get installed fine. and the complete cluster works fine. but when open the web browser and and check for ganglia we are only getting the status of graph of only the head node and not the compute nodes. Can any one help us out in this case. Regards, satish -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Rack,rank,plane
Stefan, plane = Ebene in this case. Just consider it the z-coordinate of the location. Martin --- Stefan Schustereit [EMAIL PROTECTED] wrote: Hello all together, since version 3.0.2 the location parameter is working again, and now I want to use it in our gmond configurations. I know, what a rack is. Yes, I even know, what a rank is. Maybe, I come out as an idiot, but what is a plane? All I could find out looking into my dictionaries was: Airplane: fixed-wing aircraft Uhm, we have our hosts in a data center, not on the airport... is anybody out there to light up my lack of knowledge? Thanks, Stefan -- Mapsolute GmbH Stefan Schustereit Map24 Systems and Networks http://www.map24.com --- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637alloc_id=16865op=click ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] gmond is not reporting network stats
Alexei, three questions: - which version of ganglia/gmond are you running? If possible, please try out 3.0.2. - are you using the first or the second of the two NICs? - how are your NICs named? The code drops everything starting with 'l' or 'o'. Unfortunatelly Solaris/AMD64 might not be very well tested? You can try to put some debug statements into the extract_if_data routine. Cheers Martin --- Alexei Rodriguez [EMAIL PROTECTED] wrote: Greetings. We have been running ganglia on a set of Linux systems and have not had any issues. We are now trying ganglia on Solaris 10 on x86 (AMD) systems and the cpu/memory reporting is accurate, but we are not getting any network interface information. The systems have 2 network interfaces but we only use 1 of them. Has anyone come across this problem? Any suggestions? thanks. Alexei -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia Not showing all the Graphs Properly,
Jai, just as an experiment, could you make alle the Image* functions in that file into all-lowercase? e.g. ImageCreate - imagecreate The man pages show them that way. Cheers Martin --- Jai Rangi [EMAIL PROTECTED] wrote: Hello Martin, Here is the error message I am getting for pie chart, [client client_machine] PHP Fatal error: Call to undefined function ImageCreate() in /var/www/html/ganglia/pie.php on line 117, referer: http://server_name:/ganglia/?c=Linux%20Clusterm=r=hours=descendinghc=4 Thank you so much for your help... Jai -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia Not showing all the Graphs Properly,
Jai, Ramon is right. You are at least missing php-gd. I could reproduce your problem on my FC4 installation. I did not get the pie charts also. Installing php-gd via yum and restarting Apache solved the problem. You may also need gd-devel, which provides /usr/lib/libgd.so. Check with php -m or the following small php-web-script: [EMAIL PROTECTED] html]# cat /var/www/html/phpinfo.php ?php // Show all information, defaults to INFO_ALL phpinfo(); ? Both should show gd support in some way. You can forget my E-Mail regarding case-sensitivity. I just found out that (unlike almost everything else) function names are not case-sensitive. Weird decision. I know why I do not like PHP that much ... Cheers Martin --- Ramon Bastiaans [EMAIL PROTECTED] wrote: That means you are still missing libgd and the php-gd extension, as I mailed before. - Ramon. Jai Rangi wrote: Hello Martin, Here is the error message I am getting for pie chart, [client client_machine] PHP Fatal error: Call to undefined function ImageCreate() in /var/www/html/ganglia/pie.php on line 117, referer: http://server_name:/ganglia/?c=Linux%20Clusterm=r=hours=descendinghc=4 Thank you so much for your help... Jai Martin Knoblauch wrote: Dear Jai, good to hear that 3.0.2 fixed most of the problems for you. I am not sure about the PIE stuff, but there should be some error messages in the log-files of you r web server. They could give you a hint. And there still may be PHP bugs preventing 5.0 to work correctly. As I said, we want to know. Cheers Martin --- Jai Rangi [EMAIL PROTECTED] wrote: Thanks Martin, Upgrading to 3.0.2 worked just fine. Still missing PIE though, but I guess I am missing some package for that... Thank you so much, -Jai Martin Knoblauch wrote: -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- There are really only three types of people: Those who make things happen, those who watch things happen, and those who say, What happened? --- ing. R. Bastiaans HPC - Systems Programmer SARA - Computing and Networking Services Kruislaan 415 PO Box 194613 1098 SJ Amsterdam 1090 GP Amsterdam --- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637alloc_id=16865op=click ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
RE: [Ganglia-general] PHP front end: has anyone modified the load metric color / computation?
Alexei, Richard seems to be closer to the solution. The problem is the definition of the funtion load_color in functions.php. Everything above a load of 1.0 is considered to be a problem case. Same with the function load_image. It would likely make sense to introduce a scaling variable in conf.php (default 1.0) and work that into the two functions. Can you play a bit around and show us the code that makes you happy? The problem is that the threshold for high load is very subjective. On a HPC Machine everything above 1 (per CPU or core) is likely bad. For a web/file/database server, this might be totally different. Cheers Martin --- [EMAIL PROTECTED] wrote: Of you could hack the load value itself by dividing by 5 in cluster_view.php. regards, richard p.s. this is a bit yuk, but is certainly easy. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Alexei Rodriguez Sent: 04 January 2006 07:05 To: ganglia-general@lists.sourceforge.net Subject: [Ganglia-general] PHP front end: has anyone modified the load metric color / computation? Greetings. First off, I want to say that ganglia rocks. It has been a very valuable tool in the short time we have had it deployed, and we are only using the very basic things. The load on our systems tends to be high (5.0 and above), on Solaris 10 systems (on AMD Opteron servers). The problem is that the graphs being generated are all of the same color (bright, bloody red). Given that all the systems have such high (relative) loads, I wanted to see what the best way of changing the PHP front end to reflect my local colors and load scheme. If I change $load_colors in php.conf, such that the number ranges are multiplied by 5x, would that work or is there a better way? I just want to make sure that the solution I implement does not make upgrades difficult :) thanks! Alexei For more information about Barclays Capital, please visit our web site at http://www.barcap.com. Internet communications are not secure and therefore the Barclays Group does not accept legal responsibility for the contents of this message. Although the Barclays Group operates anti-virus programmes, it does not accept responsibility for any damage whatsoever that is caused by viruses being passed. Any views or opinions presented are solely those of the author and do not necessarily represent those of the Barclays Group. Replies to this email may be monitored by the Barclays Group for operational or business reasons. -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] PHP front end: has anyone modified the load metric color / computation?
--- Alexei Rodriguez [EMAIL PROTECTED] wrote: These changes accomplish what I was looking for. Thanks! Now I don't have a sea of red that my users ask me about ;) I do think this is a good knob to have. Thank you very much! Alexei Alexei, good. This will be in 3.0.3. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] intermittent blanks in graphs
--- Ben Hartshorne [EMAIL PROTECTED] wrote: Hi, I have been running ganglia for most of the last year, quite happily. My hosts are configured to send unicast data to a single gmetad server. Recently, large portions of the cluster's graphs are empty. A sample Any thoughts? What logs should I be looking at? just a thought - are your cluster nodes time-synched? Are they [still] in-synch? [*] for those interested - I added an 8-hour and 3-day view; I find the 8-hour view the most useful by far. I also changed the size of the graphs to fit my 20 screen. Finally, I added a Disk summary graph, in addition to the Load, CPU, Memory, and Network. Is there any interest in patching these into the source? definitely. Could you post a diff -u patch, preferably against 3.0.2? Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia on Irix: gmond only
Luc, unfortunatelly, you need apr. That is why the code is shipped with ganglia and build. apr is the Apache Portable Runtime library that is used all over the code. Just grep for apr_ in the gmond directory. You do not mention what problems you have building it. Without that we cannot help you. Regards Martin PS: In which line did you add the fcntl-include for irix/metrics.c ? --- Luc Gauthier [EMAIL PROTECTED] wrote: Hi all, I just downloaded the source of Ganglia v3.0.2 and I'm planning to set it up over a couple of machines we have here. One of these is a SGI box running Irix 6.5.27, and I only want to have gmond running on it. All the other stuff will be running on a linux machine. I tried to compile the source out-of-the-box but got a couple of errors. The first one was easily fixed by following a tip given by Rene Salmon on the ganglia-developers mailing list a couple of months ago: -- I justs tried compiling ganglia-3.0.1 on Irix 6.5.27. All we really want is gmond we will run gmetad, www, and other stuff on a better supported box running linux. So we did the configure and make for just gmond on the Irix box. The make failed so I added this line to ganglia-3.0.1/srclib/libmetrics/irix/metrics.c #include fcntl.h -- The make went a litte further but broke when trying to compile 'apr'. Now I don't need apr. As I was saying, I only want gmond. Rene Salmon says, in the cited message So we did the configure and make for just gmond on the Irix box. I would like to do the same but unfortunately don't know how. Can anyone give me a hint ? Thanks in advance for your help and have a good day, Luc Gauthier -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] intermittent blanks in graphs
Hi Ben, see below. In any case, could you please open a case in bugzilla and assign it to me? Martin --- Ben Hartshorne [EMAIL PROTECTED] wrote: Everyone, thanks very much for your suggestions. I've replied to each below. On Tue, Jan 24, 2006 at 04:16:08AM -0800, Martin Knoblauch wrote: just a thought - are your cluster nodes time-synched? Are they [still] in-synch? to within a second or so. I also have several gmetrics that are running at a 2-min interval, and they exhibit the same behavior. I would be suprised to see them reporting the same second, 2 minutes apart... OK. That seems clean. [snip] On Tue, Jan 24, 2006 at 04:46:50PM -0500, Rick Mohr wrote: Also, you could use rrdtool to generate the exact same graph that is shown on the web page for one of these metrice and dump it straight into a file. Then you could compare that with the image seen on the web page (to check for the unlikely event that the generated image if fine, but the web server is messing something up). hmm... that's a good suggestion. Here's an excerpt from 'rrdtool dump': !-- 2006-01-24 17:36:45 PST / 1138153005 -- rowv 9.315467e+00 /v/row !-- 2006-01-24 17:37:00 PST / 1138153020 -- rowv 8.80e+00 /v/row !-- 2006-01-24 17:37:15 PST / 1138153035 -- rowv 8.80e+00 /v/row !-- 2006-01-24 17:37:30 PST / 1138153050 -- rowv 8.80e+00 /v/row !-- 2006-01-24 17:37:45 PST / 1138153065 -- rowv 8.80e+00 /v/row !-- 2006-01-24 17:38:00 PST / 1138153080 -- rowv NaN /v/row !-- 2006-01-24 17:38:15 PST / 1138153095 -- rowv NaN /v/row !-- 2006-01-24 17:38:30 PST / 1138153110 -- rowv NaN /v/row !-- 2006-01-24 17:38:45 PST / 1138153125 -- rowv NaN /v/row !-- 2006-01-24 17:39:00 PST / 1138153140 -- rowv NaN /v/row Correspondingly, in the graph seen through ganglia, the data ends about 17:38. I'm suprised it's registering these things every 15 seconds! I thought the period was slower than that (every min). I checked a few other rrds at different resolutions, and the NaN sections do correspond to the blank parts. So what does it mean? This tells us that the data is not getting put into the rrds. We know that the values are getting to the collector host, because clicking on the 'gmetric' portion of the website shows current data. But that data is not making it into the RRD somehow... I thought maybe the RRDs had become corrupted somehow, so tried out moving the rrds out of place so ganglia would recreate them all. The symptom was still in evidence. I don't see that error message, but while looking for it, I did see this error message: Jan 24 17:24:18 localhost /usr/sbin/gmetad[30443]: RRD_update (/var/lib/ganglia/rrds/production/raiden-8-db1/users.rrd): conversion of 'min,' to float not complete: tail 'min,' This seems to relate to a recent change I made that I had forgotten about. :) I added the following line to my crontab: */2 * * * * /usr/bin/gmetric --name=users --value=`w | head -1 | awk '{print $6}'` --type=int16 The purpose of this line is to create a graph representing the number of logged in users to the host. it seems right to me - do any of you see a problem with this line? Not sure. How does the live users metric from gmond look like? Definitely an interesting coincidence. In any case, we need to look into how gmetad operates with rrdtool. Unfortunatelly, I am more the gmond guy. Most importatn, we need to find out what triggers the behaviour. Thanks for your patience. In the course of this investigation, I have come across another stange happening. Some of the metrics seem to be ... off. I have no idea if these things are related. I was suprised to notice that many of my servers show excessive time in the CPU_report graph as having all their time spent in CPU Wait. That didn't seem right and also didn't jive with the output of vmstat. Looking at the individual metrics that make up the cpu_report, I see: * cpu_aidle: 1388 * cpu_idle: 66.00 * cpu_nice: 0.00 * cpu_system: 2.30 * cpu_user: 31.70 * cpu_wio: 1388 All 6 of these metrics are supposed to be percentages. What's up with 1,388? Bouth cpu_aidle and cpu_wio are linearly decreasing graphs with the same slope (and same current value). They look to be the same back into the shown history, but it's hard to be exact. This seems to be the case (with different current values) on a number of hosts. Two .pngs of hosts exhibiting this behavior are at http://cryptio.net/~ben/ganglia/host_report.png and http://cryptio.net/~ben/ganglia/host_report2.png Note that these stats are all created since I moved the old files out of place earlier today, so there is no chance of left over corruption. Are my hosts dying? restarting gmond on the host seems to have no effect. Would it be possible to create this kind of error by upgrading the server to gmetad 3.0.2
Re: [Ganglia-general] intermittent blanks in graphs
--- Martin Knoblauch [EMAIL PROTECTED] wrote: error message: Jan 24 17:24:18 localhost /usr/sbin/gmetad[30443]: RRD_update (/var/lib/ganglia/rrds/production/raiden-8-db1/users.rrd): conversion of 'min,' to float not complete: tail 'min,' This seems to relate to a recent change I made that I had forgotten about. :) I added the following line to my crontab: */2 * * * * /usr/bin/gmetric --name=users --value=`w | head -1 | awk '{print $6}'` --type=int16 The purpose of this line is to create a graph representing the number of logged in users to the host. it seems right to me - do any of you see a problem with this line? actually, on my system (FC4) your command results in: $ w | head -1 | awk '{print $6}' users, $ which is not really what you want to put into that metric :-) Apparently yours seem to report min, which would be $4 on my system. The number of users would be $5. Maybe different versions of procps? Hmm. Weird. Just played around with the setting of LANG and not the command reports load instead of users,. Really weird . ha !!! The format of the w output changes with the uptime. The position of the #users definitely flows around. Guess you need to work on the awk. You need to look for users and take the token before that. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia on Irix: gmond only
Hi Luc, yes, the native IRIX does not support --version. Actually, using gmake is the right thing to do. Your toolchain seems older than mine (automake-1.9.5, autoconf-2.59, libtool-1.5.20), but newer than the recommended (1.6.3, 2.53, 1.4.2). What is the version of libtool? In any case, what worries me are the syntax errors from configure. Maybe you can check what they are about. Configuring expat ... configure: loading cache /home/master/shared/ganglia_src/ganglia-3.0.2/config.cache ./configure[1347]: syntax error at line 157 : `(' unexpected Configuring apr ... configure: loading cache /home/master/shared/ganglia_src/ganglia-3.0.2/config.cache ./configure[1398]: syntax error at line 157 : `(' unexpected Configuring libconfuse ... configure: loading cache /home/master/shared/ganglia_src/ganglia-3.0.2/config.cache ./configure[1428]: syntax error at line 157 : `(' unexpected Also, could you reproduce a list of files with those wrong pathes (after configure). And no, IRIX 6.5.24m vs. 6.5.27m should not make a difference. Thanks Martin --- Luc Gauthier [EMAIL PROTECTED] wrote: Hi Martin, Quite surprisingly, I was unable to determine the version of 'make' that is installed on the machine. There is indeed no option or way to get the version. So I guess we could describe it as the version of make that comes with Irix 6.5.24m. -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] intermittent blanks in graphs
--- Ben Hartshorne [EMAIL PROTECTED] wrote: Jan 24 17:24:18 localhost /usr/sbin/gmetad[30443]: RRD_update (/var/lib/ganglia/rrds/production/raiden-8-db1/users.rrd): conversion of 'min,' to float not complete: tail 'min,' This seems to relate to a recent change I made that I had forgotten about. :) I added the following line to my crontab: */2 * * * * /usr/bin/gmetric --name=users --value=`w | head -1 | awk '{print $6}'` --type=int16 OK, as I discovered before, your command can put funny things like min, into the metrics stream. Unfortunatelly, gmetric or gmond are stupid enough to handle that. I can now kind of reproduce your problem by inserting the following into the stream: gmetric --name=users --type=int16 --value=min, This appears then in both the gmond and gmetad XML. As a result, the report graphs on my cluster view show the gaps. As soon as I insert a number into the stream, the graphs work fine. But - I only see the gaps in the cluster overview. The node displays are not affected (both in the cluster overview and on the node pages). Seems we need to make gmetric or gmond more robust against junk. Or we need to see what the problem in the web interface is. Or both :-) Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
RE: [Ganglia-general] intermittent blanks in graphs
As you wish. You are old fashioned :-) Martin --- [EMAIL PROTECTED] wrote: Call me old fashioned, but: who | wc -l | awk '{print $1}' strikes me as safer regard, richard -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] gmond not multicasting to other nodes
value_threshold = 1.0 } metric { name = cpu_sintr value_threshold = 1.0 } */ } collection_group { collect_every = 20 time_threshold = 90 /* Load Averages */ metric { name = load_one value_threshold = 1.0 } metric { name = load_five value_threshold = 1.0 } metric { name = load_fifteen value_threshold = 1.0 } } /* This group collects the number of running and total processes */ collection_group { collect_every = 80 time_threshold = 950 metric { name = proc_run value_threshold = 1.0 } metric { name = proc_total value_threshold = 1.0 } } /* This collection group grabs the volatile memory metrics every 40 secs and sends them at least every 180 secs. This time_threshold can be increased significantly to reduce unneeded network traffic. */ collection_group { collect_every = 40 time_threshold = 180 metric { name = mem_free value_threshold = 1024.0 } metric { name = mem_shared value_threshold = 1024.0 } metric { name = mem_buffers value_threshold = 1024.0 } metric { name = mem_cached value_threshold = 1024.0 } metric { name = swap_free value_threshold = 1024.0 } } collection_group { collect_every = 40 time_threshold = 300 metric { name = bytes_out value_threshold = 4096 } metric { name = bytes_in value_threshold = 4096 } metric { name = pkts_in value_threshold = 256 } metric { name = pkts_out value_threshold = 256 } } /* Different than 2.5.x default since the old config made no sense */ collection_group { collect_every = 1800 time_threshold = 3600 metric { name = disk_total value_threshold = 1.0 } } collection_group { collect_every = 40 time_threshold = 180 metric { name = disk_free value_threshold = 1.0 } metric { name = part_max_used value_threshold = 1.0 } } -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] intermittent blanks in graphs
Hi Ben, just for your info. In 3.0.3 gmetric will have a check to prevent non-numbers being inserted into the XML stream. In the meanwhile below patch may help you. I will discuss it with on the developers list. Martin --- Martin Knoblauch [EMAIL PROTECTED] wrote: Ben, are you able to rebuild gmetad with the follwing qick fix? This seems to solve it for me: --- rrd_helpers.c-orig 2006-01-25 16:14:16.0 +0100 +++ rrd_helpers.c 2006-01-25 16:10:27.0 +0100 @@ -54,7 +54,7 @@ { err_msg(RRD_update (%s): %s, rrd, rrd_get_error()); pthread_mutex_unlock( rrd_mutex ); - return 1; + return 0; } /* debug_msg(Updated rrd %s with value %s, rrd, val); */ pthread_mutex_unlock( rrd_mutex ); --- Martin Knoblauch [EMAIL PROTECTED] wrote: --- Ben Hartshorne [EMAIL PROTECTED] wrote: Jan 24 17:24:18 localhost /usr/sbin/gmetad[30443]: RRD_update (/var/lib/ganglia/rrds/production/raiden-8-db1/users.rrd): conversion of 'min,' to float not complete: tail 'min,' This seems to relate to a recent change I made that I had forgotten about. :) I added the following line to my crontab: */2 * * * * /usr/bin/gmetric --name=users --value=`w | head -1 | awk '{print $6}'` --type=int16 OK, as I discovered before, your command can put funny things like min, into the metrics stream. Unfortunatelly, gmetric or gmond are stupid enough to handle that. I can now kind of reproduce your problem by inserting the following into the stream: gmetric --name=users --type=int16 --value=min, This appears then in both the gmond and gmetad XML. As a result, the report graphs on my cluster view show the gaps. As soon as I insert a number into the stream, the graphs work fine. But - I only see the gaps in the cluster overview. The node displays are not affected (both in the cluster overview and on the node pages). Seems we need to make gmetric or gmond more robust against junk. Or we need to see what the problem in the web interface is. Or both :-) Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de --- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnkkid=103432bid=230486dat=121642 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de --- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnkkid=103432bid=230486dat=121642 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] 2 clusters in same subnet
--- regatta [EMAIL PROTECTED] wrote: Hi everyone I have one comment about ganglia document and one question, my comment is that there is no REALLY document about how to use/ ganglia (please don't ask me to read http://ganglia.info/docs/, it's the worse document I every saw, it suppose that you are expert in ganglia) you may have a point here. why notv become an expert and write new docs ? :-) Now my question :) : I have two clusters in the same subnet (each cluster has 24 nodes) , now why they are the same subnet, this is different subject but they must be :) now how can I configure one node in each cluster to run gmetad and the php-web to display the 2 clusters as 2 clusters or grids what I did is that I installed gmond in all nodes (in both clusters), I changed /etc/gmod.conf in cluster A to : cluster { name = Cluster A } and in cluster B cluster { name = Cluster B } but when I go to gmetad I find it sometime it collect them all together or it put some node in A to be B and some B to A !! Any help ? You need to separate the ports where your clusters multicast. Default is 8649. Select another port for (8648) for your second cluster. Then you need to define two datasources in gmetad.conf (you only need one of those). data_source cluster 1 node_in_cluster_1:8649 data_source cluster 2 node_in_cluster_2:8648 That should do the trick. Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Pointers on architecting a largescale ganglia setup??
--- Joel Krauska [EMAIL PROTECTED] wrote: Rick Mohr wrote: The unicast approach does save on gmond memory usage as you mentioned. It's up to each site to determine just how much memory the metrics will take up, and if it is considered a significant amount. (But it can get somewhat big on a large cluster like mine with a bunch of added metrics.) Can you share any code you've written for additional metrics? just in case you did not know: http://ganglia.sourceforge.net/gmetric/ Everyone is invited to contribute to the repository. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Pointers on architecting a largescale ganglia setup??
Joel, 2gmetric (at least in 3.0.x) takes a -c argument where you can specify the path to gmond.conf. gmetric will then use any transport defined for gmond. Simple, isn't it? Martin --- Joel Krauska [EMAIL PROTECTED] wrote: Martin Knoblauch wrote: just in case you did not know: http://ganglia.sourceforge.net/gmetric/ Hadn't known about this -- thanks. Question: I just went an covnerted to using UDP unicasts. The gmetric man page seems to imply that it only supports the multicast comm method. Is there a way for gmetric just to report to the local gmond? DESCRIPTION The Ganglia Metric Client (gmetric) announces a metric value to all Ganglia Monitoring Daemons (gmonds) that are listening on the cluster multicast channel. I'll likely figure this out soon, but I thought I'd bring it up. --joel -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
[Ganglia-general] gmond binary for hp-ux-11.11/hppa
Hi, anybody who could provide me with a 3.0.2 gmond executable for the following arch: HP-UX hdsdm3 B.11.11 U 9000/800 Unfortunatelly the systems I want to look at have no decent development environment. TIA Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Solaris 9 compile of 3.02
Hi Russel, gcc (3.X onward) is the only compiler that we (I) recommend on any platform. And we never promised anything else :-) Cheers Martin --- Russell Nordquist [EMAIL PROTECTED] wrote: I am having problems compiling ganglia 3.02 on Sparc Solaris 9 machine using Forte 7 (an older version of Sun Studio). This is the error I get: [EMAIL PROTECTED]:~/ganglia-3.0.2$make make all-recursive Making all in srclib Making all in libmetrics make all-recursive Making all in solaris source='metrics.c' object='metrics.lo' libtool=yes \ DEPDIR=.deps depmode=none /bin/bash ../build/depcomp \ /bin/bash ../libtool --tag=CC --mode=compile cc -DHAVE_CONFIG_H -I. -I. -I.. -I.. -I../lib -g -D__STDC__ -D_POSIX_C_SOURCE=199506L -DHAVE_STRERROR -c -o metrics.lo metrics.c cc -DHAVE_CONFIG_H -I. -I. -I.. -I.. -I../lib -g -D__STDC__ -D_POSIX_C_SOURCE=199506L -DHAVE_STRERROR -c metrics.c -KPIC -DPIC -o .libs/metrics.o command line: warning: macro redefined: __STDC__ /usr/include/sys/resource.h, line 126: incomplete struct/union/enum timeval: ru_utime ../unpifi.h, line 34: syntax error before or at: u_char ../unpifi.h, line 34: cannot recover from previous errors cc: acomp failed for metrics.c *** Error code 1 make: Fatal error: Command failed for target `metrics.lo' Current working directory /home/russelln/ganglia-3.0.2/srclib/libmetrics/solaris *** Error code 1 continues googleing has lead me to the same error for others (including sunfreeware) without a posted solution. I have not tried with gcc since it is not part of our local builds. Has anyone successfully compiled ganglia for Solaris 9? If so what compiler did you use. thanks russell --- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnkkid=103432bid=230486dat=121642 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Solaris 9 compile of 3.02
Russel, personally I can only report successful building on Solaris 8 (64-bit Sparc). I know of at least one guy who was able to successful build on Solaris 10 (AMD-64). I see no reason why Solaris 9 should be a problem. Martin --- Russell Nordquist [EMAIL PROTECTED] wrote: Ok. I can understand that. Any reports of successful Solaris 9 compiles? with gcc? russell On 2/2/06 3:43 PM, Martin Knoblauch wrote: Hi Russel, gcc (3.X onward) is the only compiler that we (I) recommend on any platform. And we never promised anything else :-) Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia on Irix: gmond only
Hi Luc, good to know you are happy now. As I do not remember exactely what I told you back then, would you be willing to do a final experiment? Just remove all config.cache files from the tree before doing configure (with the original configure script). And/or do a make distclean before. Could be a sed problem. Mine is newer (4.1.4), but we never had a requirement on the sed-version. Cheers Martin --- Luc Gauthier [EMAIL PROTECTED] wrote: Hi Martin, I finally had time to come back to my problem compiling ganglia on Irix... Thanks to your hint, I managed to have everything working ! If you remember, you suggested I should investigate those error messages I got when running the 'configure' script: Configuring expat ... configure: loading cache /home/master/shared/lgauthie/ganglia/ganglia-3.0.2/config.cache ./configure[1347]: syntax error at line 157 : `(' unexpected Configuring apr ... configure: loading cache /home/master/shared/lgauthie/ganglia/ganglia-3.0.2/config.cache ./configure[1398]: syntax error at line 157 : `(' unexpected Configuring libconfuse ... configure: loading cache /home/master/shared/lgauthie/ganglia/ganglia-3.0.2/config.cache ./configure[1428]: syntax error at line 157 : `(' unexpected Those error messages did not stop the 'configure' script from running but I had those strange error messages (hard links not adjusted to our directory architecture) when gmake'ing: gmake[3]: *** No rule to make target `/home/scratch/ganglia-cvs/ganglia- post-2_5_7/monitor-core/srclib/apr/build/apr_rules.mk'. Stop. So I decided to take a look at line 157 of the config.cache file that was giving the errors at configuration time and here it is: test ${lt_cv_sys_global_symbol_to_c_name_address+set} = set || lt_cv_sys_global_symbol_to_c_name_address='sed -n -e '\''s/^\'^: \([^ \'^ ]*\) $/ {\\1\, (lt_ptr) 0},/p'\'' -e '\''s/^\'^[BCDEGRST] \([^ \'^ ]*\) \([^\'^ ]*\)$/ {\2, (lt_ptr) \\2},/p'\' Unfortunately, I must admit I am not advanced enough to rapidly see where is that unexpected '('. And since I did not have enough time to investigate it, I decided to refrain 'expat', 'apr' and 'libconfuse' from using the cache file at configuration time. To do so, I simply modified three lines in the 'configure' file (line number between square brackets): [2119] cd srclib/expat ./configure --cache-file= $ganglia_popdir/config.cache became [2119] cd srclib/expat ./configure [2123] cd srclib/apr ./configure --cache-file= $ganglia_popdir/config.cache became [2123] cd srclib/apr ./configure [2127] cd srclib/confuse ./configure --cache-file= $ganglia_popdir/config.cache --disable-nls became [2127] cd srclib/confuse ./configure --disable-nls So those three modules do the whole configuration step from scratch. Of course, it takes more time and this is not the best way to do the job but in the end, the gmake step goes all the way to the end and I get the binaries I've been waiting for ! :) Now that I have them, I can go on and try to set up ganglia on our network. I'll come back to the list if I have problems there. Thanks again for your help and if it can help, here is a summary of the tools I have on the machine I compiled ganglia on : - OS: IRIX 6.5.24m - sed: GNU sed version 4.0.7 - test: test (GNU sh-utils) 2.0 - automake: automake (GNU automake) 1.7.5 - autoconf: autoconf (GNU Autoconf) 2.57 - gmake: GNU Make 3.80 - install: install (fileutils) 4.1 Best regards, Luc Gauthier Le mercredi 25 janvier 2006 � 06:19 -0800, Martin Knoblauch a �crit : Hi Luc, yes, the native IRIX does not support --version. Actually, using gmake is the right thing to do. Your toolchain seems older than mine (automake-1.9.5, autoconf-2.59, libtool-1.5.20), but newer than the recommended (1.6.3, 2.53, 1.4.2). What is the version of libtool? In any case, what worries me are the syntax errors from configure. Maybe you can check what they are about. Configuring expat ... configure: loading cache /home/master/shared/ganglia_src/ganglia-3.0.2/config.cache ./configure[1347]: syntax error at line 157 : `(' unexpected Configuring apr ... configure: loading cache /home/master/shared/ganglia_src/ganglia-3.0.2/config.cache ./configure[1398]: syntax error at line 157 : `(' unexpected Configuring libconfuse ... configure: loading cache /home/master/shared/ganglia_src/ganglia-3.0.2/config.cache ./configure[1428]: syntax error at line 157 : `(' unexpected Also, could you reproduce a list of files with those
Re: [Ganglia-general] Using Ganglia to monitor JVM based services and DB servers.
Dear Miguel, --- Jos� Miguel Pereira Tavares [EMAIL PROTECTED] wrote: Hi all! As far as I could find out gmond has a set of metrics built-in at compile time (a rather convenient set most of the time). If more information regarding the node or the software running on that node is necessary then gmetric can be used to publish that metric. Hoping that the previous paragraph affirmation is correct I would like to ask around some question, though I will also provide some possible answers/thoughts about them: 1. Doesn't using gmetric (forking a process) consume a bit too much of the system resources? This really depends on: a) the resources of your system b) what is the number of new metreics that you want to insert into the XML stream c) what is the frequency you are calling gmetric. I personally would not worry to much. At least not before measuring the impcat on a life system :-) In any case, if gmetric is to heavy for yoo, you could always integrate your metrics into the metrics reported by gmond. This is not trivial, but not impossible. The drawback is that it may make that gmond incompatible with the standard one. Another solution would be to look at the gemetric source and write your own version that collects all interesting metrics and submits them together. That way you would reduce the number of forks. Seems you already contemplated this. 2. I need to monitor a JVM profile. Has anyone tried something like similar? Any thoughts or ideas on best way to achieve this with Ganglia? 3. I also need to monitor some database services... thoughts and ideas most welcome. for both 2 and three - you need to retrieve the metrics and feed them into the stream. Ok, probably not what you wanted to hear :-) Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] No RRDs Created On MacOSX
--- Mike Walker [EMAIL PROTECTED] wrote: However, if I run gstat -a I do get the data I would expect. But when I run anything with gmetric I get nothing (no errors no output). Of course I might be doing gmetric wrong, so here is what I tried. 'gmetric -n mem_free -v mem_free -t uint32' Mike, this could be a real killer. Up to/including 3.0.2 gmetad has a bug that will stop any host reporting metrics if a integer/floating typed metric has a value that does not represent a number. Unfortunatelly gmetric is not very picky about the strings that get passed via -v. The next release (3.0.3, no planned date) will have a fix that makes gmetric check whether the -v string translates into a number. In the meanwhile, you could try the following fix to gmetad: --- rrd_helpers.c-orig 2006-01-25 16:14:16.0 +0100 +++ rrd_helpers.c 2006-01-25 16:10:27.0 +0100 @@ -54,7 +54,7 @@ { err_msg(RRD_update (%s): %s, rrd, rrd_get_error()); pthread_mutex_unlock( rrd_mutex ); - return 1; + return 0; } /* debug_msg(Updated rrd %s with value %s, rrd, val); */ pthread_mutex_unlock( rrd_mutex ); If it fixes your problems, please report back. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Missing stats on Irix
--- Alex Balk [EMAIL PROTECTED] wrote: Network stats are also missing on HPUX. Not hardly a dying species... yeah, but I would not call it growing like fungus either :-) If only I had the time, the system and a working gcc environment for HP-UX. At least I would be able to compile 3.0.2. I still need a binary for a HPPA machine under 11.11. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia truncating larger status messages
Hi Ian, beating me with this advice :-) Chris - Please assign the bug to me. Not that I know how to fix this short term, as upgrading the whole of apr might be a challenge. I could imagine a way to specify a different apr-location with configure. Cheers Martin --- Ian Cunningham [EMAIL PROTECTED] wrote: Chris, Thanks for the notice. Please file a bug with your work around here: http://bugzilla.ganglia.info/ Thanks again, Ian Chris Black wrote: Hello list, We're using Ganglia at the University of Michigan to monitor cluster nodes, and we found an issue with 3.0.2. When sending status messages from gmond to gmetad, messages over ~66600 bytes would be truncated and the trailing /GANGLIA tag (among a few others at the end) would be missing, and the gmetad host would mark that client as missing. We found the problem to be in version 0.9.5 of the Apache Portable Runtime (APR) that shipped with Ganglia 3.0.2. Upgrading to the newest APR (0.9.7) fixed the problem. We used the following procedure to correct the problem on Mac OS X Server 10.4.4 Buid 8G32: 1) untar the ganglia sources 2) cd into the ganglia-3.0.2/srclib directory 3) remove the 'apr' directory 4) download the 0.9.7 sources of apr into this directory (ganglia-3.0.2/srclib) 5) untar the apr sources 6) rename the resulting apr-0.9.7 directory to apr (or create a symlink) 7) move up one directory to ganglia-3.0.2 8) build/install as normal Hopefully this will be of assistance to anyone seeing a similar problem. Chris Black LSA-IT University of Michigan --- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnkkid=103432bid=230486dat=121642 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
[Ganglia-general] First prerelease of ganglia-3.0.3 ready for testing
Hi friends of Ganglia, please find the first test drop of the upcoming ganglia-3.0.3 release at: http://www.knobisoft.de/ganglia/ganglia-3.0.3.200602231926-apr0.9.7.tar.gz This is again planned to be a minor bug-fix release which is supposed to be compatible to earlier 3.0.X releases of ganglia. So far, the differences against 3.0.2 are minimal: - minor fixes to the documentation - make gmetric more robust against illegal numeric values, which could cause gmetad to stop recording complete nodes. - fix the libconfuse.spec file (Copyright - License, Swedish Locales). - fix make check. Expat would not know how to do it. - AIX: fix proc_total, proc_run, swap_free and swap_total. Implement mem_cached - introduce a scaling factor for the load - colour-code transformation in the web-frontend. The default of 1.0 is only good for HPC nodes. Fileservers and similar would go red to early. - replace apr with version 0.9.7. This is supposed to fix some problems with large chunks of XML being truncated. In fact this is the biggest change in this release and needs testing !!! So, please download and test. Especially on the non-Linux platforms. Thanks Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] gmond install problems
Dan, you actually need to rebuild gmond on your box. Sorry. Either get the tarball, or get the source RPM and rebuild from that. Martin --- Dan Roberts [EMAIL PROTECTED] wrote: Hello All How do I get around this error without going to glibc 2.3.3 as suggested? sudo rpm -Uvh ganglia-3.0.2-1/ganglia-gmond-3.0.2-1.i386.rpm Preparing... ### [100%] 1:ganglia-gmond ### [100%] Starting GANGLIA gmond: /usr/sbin/gmond: relocation error: /usr/sbin/gmond: symbol sys_siglist, version GLIBC_2.3.3 not defined in file libc.so.6 with link time reference [FAILED] rpm -qa | grep glibc | sort glibc-2.3.2-27.9.7 glibc-common-2.3.2-27.9.7 glibc-devel-2.3.2-27.9.7 glibc-kernheaders-2.4-8.10 glibc-profile-2.3.2-27.9.7 glibc-utils-2.3.2-27.9.7 I have another system which supports the same version gmond using a slightly different version of glibc as shown below.. How can I get the above system working correctly without upgrading to glibc 2.3.3?! I noted that my working system below has the glibc-headers rpm installed while my failing system doesn't. Might this be the problem? If YEs, could someone point me to the location of the rpm which I could download. I couldn't find it on the net. Thanks for any help! Dan rpm -qa | grep glibc glibc-headers-2.3.2-95.20 glibc-kernheaders-2.4-8.34 glibc-2.3.2-95.20 glibc-common-2.3.2-95.20 glibc-utils-2.3.2-95.20 glibc-devel-2.3.2-95.20 compat-glibc-7.x-2.2.4.32.6 glibc-profile-2.3.2-95.20 -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Changing node name
Richard, that is on my todo list. Just having some time being the issue ... Martin --- Richard Lefebvre [EMAIL PROTECTED] wrote: Is there a way to set the nodename in gmond.conf? Instead of using reverse hostname lookup using the IP. I'm running ganglia 3.0.1 on an Cray XD1 and the IP gmond uses is the external on instead of the internal one. The external IP has no hostname associated with it is is given at random. Richard --- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnkkid=110944bid=241720dat=121642 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
[Ganglia-general] Upgrade to apr-0.9.7
Hi, everyone monitoring ganglia-cvs will by now have seen that I have upgraded the apr sources within the ganglia CVS tree to version 0.9.7. This was done to fix some reported problems with the old version. So, if you are using CVS sources to build ganglia, please do a cvs update -Pd or just do a new checkout. The new tree builds fine on my FC4 notebook, including RPM building. I plan to do an Aprils Fool tarball release very soon. Please check in anything you think is valuable. Cheers Martin PS: Sorry for the many notification mails on ganglia-cvs. -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] gmond unreliable on one cluster, must be constantly restarted
, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnkkid=110944bid=241720dat=121642 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML
DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=proc_total VAL=128 TYPE=uint32 UNITS= TN=8 TMAX=950 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=mem_free VAL=1328356 TYPE=uint32 UNITS=KB TN=8 TMAX=180 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=mem_shared VAL=0 TYPE=uint32 UNITS=KB TN=8 TMAX=180 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=mem_buffers VAL=199232 TYPE=uint32 UNITS=KB TN=8 TMAX=180 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=mem_cached VAL=4569200 TYPE=uint32 UNITS=KB TN=8 TMAX=180 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=swap_free VAL=2101964 TYPE=uint32 UNITS=KB TN=8 TMAX=180 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=gexec VAL=ON TYPE=string UNITS= TN=188 TMAX=300 DMAX=0 SLOPE=zero SOURCE=gmond/ METRIC NAME=bytes_out VAL=6066.85 TYPE=float UNITS=bytes/sec TN=8 TMAX=300 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=bytes_in VAL=203006.30 TYPE=float UNITS=bytes/sec TN=8 TMAX=300 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=numthreads VAL=2 TYPE=int8 UNITS= TN=324 TMAX=60 DMAX=0 SLOPE=both SOURCE=gmetric/ METRIC NAME=numjobs VAL=2 TYPE=int8 UNITS= TN=324 TMAX=60 DMAX=0 SLOPE=both SOURCE=gmetric/ /HOST Good host: gmond: Processing a Ganglia_message from goodhost gmetad: Updating host goodhost, metric numjobs server_thread() received request /Opteron_Production-Desktop_Droid_Cluster/goodhost from 127.0.0.1 XML: HOST NAME=goodhost IP=10.73.16.225 REPORTED=1143682838 TN=1 TMAX=20 DMAX=0 LOCATION=unspecified GMOND_STARTED=1143137198 METRIC NAME=cpu_num VAL=2 TYPE=uint16 UNITS=CPUs TN=838 TMAX=1200 DMAX=0 SLOPE=zero SOURCE=gmond/ METRIC NAME=disk_total VAL=71.047 TYPE=double UNITS=GB TN=2039 TMAX=1200 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=disk_free VAL=46.667 TYPE=double UNITS=GB TN=178 TMAX=180 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=cpu_speed VAL=2411 TYPE=uint32 UNITS=MHz TN=838 TMAX=1200 DMAX=0 SLOPE=zero SOURCE=gmond/ METRIC NAME=part_max_used VAL=70.5 TYPE=float UNITS= TN=178 TMAX=180 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=mem_total VAL=8147640 TYPE=uint32 UNITS=KB TN=838 TMAX=1200 DMAX=0 SLOPE=zero SOURCE=gmond/ METRIC NAME=swap_total VAL=2104504 TYPE=uint32 UNITS=KB TN=838 TMAX=1200 DMAX=0 SLOPE=zero SOURCE=gmond/ METRIC NAME=boottime VAL=1142553979 TYPE=uint32 UNITS=s TN=838 TMAX=1200 DMAX=0 SLOPE=zero SOURCE=gmond/ METRIC NAME=machine_type VAL=x86_64 TYPE=string UNITS= TN=838 TMAX=1200 DMAX=0 SLOPE=zero SOURCE=gmond/ METRIC NAME=os_name VAL=Linux TYPE=string UNITS= TN=838 TMAX=1200 DMAX=0 SLOPE=zero SOURCE=gmond/ METRIC NAME=os_release VAL=2.6.13.4_K8+NUMA+NV TYPE=string UNITS= TN=838 TMAX=1200 DMAX=0 SLOPE=zero SOURCE=gmond/ METRIC NAME=cpu_user VAL=73.1 TYPE=float UNITS=% TN=8 TMAX=90 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=cpu_system VAL=3.9 TYPE=float UNITS=% TN=8 TMAX=90 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=load_one VAL=1.99 TYPE=float UNITS= TN=9 TMAX=70 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=proc_run VAL=2 TYPE=uint32 UNITS= TN=149 TMAX=950 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=proc_total VAL=156 TYPE=uint32 UNITS= TN=149 TMAX=950 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=mem_free VAL=2359176 TYPE=uint32 UNITS=KB TN=28 TMAX=180 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=mem_shared VAL=0 TYPE=uint32 UNITS=KB TN=28 TMAX=180 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=mem_buffers VAL=36384 TYPE=uint32 UNITS=KB TN=28 TMAX=180 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=mem_cached VAL=4162056 TYPE=uint32 UNITS=KB TN=28 TMAX=180 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=swap_free VAL=1786428 TYPE=uint32 UNITS=KB TN=28 TMAX=180 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=gexec VAL=ON TYPE=string UNITS= TN=229 TMAX=300 DMAX=0 SLOPE=zero SOURCE=gmond/ METRIC NAME=bytes_out VAL=305162.19 TYPE=float UNITS=bytes/sec TN=28 TMAX=300 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=bytes_in VAL=40802.30 TYPE=float UNITS=bytes/sec TN=28 TMAX=300 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=numthreads VAL=1 TYPE=int8 UNITS= TN=844 TMAX=60 DMAX=0 SLOPE=both SOURCE=gmetric/ METRIC NAME=numjobs VAL=1 TYPE=int8 UNITS= TN=844 TMAX=60 DMAX=0 SLOPE=both SOURCE=gmetric/ /HOST --- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnkkid=110944bid=241720dat=121642 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i
RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML
Eli, OK, the messages coming from RRDTOOL, just telling that you tried to update the same metric with exactely the same timestamp. Do you see any messages prefixed RRD_create in your logfiles? The problem is that if one of the rrd_updates fails, gmetad stops working on anything. Do you have a chance to rebuild gmetad with the following patch? It is against current CVS, but should apply against 3.0.2. If it helps, all hosts (metrics) except the one causing problems should be OK. It might not be the real solution, but may help us to track it down. [gmetad]$ diff -udp rrd_helpers.c rrd_helpers.c-new --- rrd_helpers.c 2005-03-15 19:11:33.0 +0100 +++ rrd_helpers.c-new 2006-03-30 11:28:26.0 +0200 @@ -54,7 +54,7 @@ RRD_update( char *rrd, const char *sum, { err_msg(RRD_update (%s): %s, rrd, rrd_get_error()); pthread_mutex_unlock( rrd_mutex ); - return 1; + return 0; } /* debug_msg(Updated rrd %s with value %s, rrd, val); */ pthread_mutex_unlock( rrd_mutex ); In addition, do you see any messages prefixed RRD_create in your logfiles? You should, as some of the RRD files seem to be missing. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML
Richard, [adding ganglia-developers for comments] pretty good explanation of what is likely happening, or what can go wrong. I sent Eli a patch I found useful a while ago, but which is not in CVS yet (because I fixed the root-problem of the illegal updates). This should prevent gmetad from ignoring all hosts/metrics if just one of them is corrupt. Somewhere in the code we go nuts on an error return. [gmetad]$ diff -udp rrd_helpers.c rrd_helpers.c-new --- rrd_helpers.c 2005-03-15 19:11:33.0 +0100 +++ rrd_helpers.c-new 2006-03-30 11:28:26.0 +0200 @@ -54,7 +54,7 @@ RRD_update( char *rrd, const char *sum, { err_msg(RRD_update (%s): %s, rrd, rrd_get_error()); pthread_mutex_unlock( rrd_mutex ); - return 1; + return 0; } /* debug_msg(Updated rrd %s with value %s, rrd, val); */ pthread_mutex_unlock( rrd_mutex ); --- [EMAIL PROTECTED] wrote: Eli, Martin is most surely right. If you are running an unpatched 3.0.2, let me share with you the many ways it can all go wrong. gmond generates the hostnames found in the XML stream by reverse DNS lookup only. Its internal structures treat every different IP address it sees as a different host, regardless of what the reverse DNS entry is. So, if you have 1) Incorrect reverse DNS entries such that 2 different hosts reverse map to the same hostname, 2) Or 2 NICs on a host that are not teamed (i.e. 2 different addresses) and the routing allows packets to exit either NIC, hence either source address may be used. 3) Or a DHCP lease renewal that results in a host changing IP addresses. Then what will happen is that the XML stream from the cluster will contain 2 (or more) entries with different IP addrs, but the same name. Even in the DHCP case when only 1 source address is used at a time, gmond will keep the old IP address entry until a timeout, even though it is not being updated. So dups arise again. Now unfortunately, gmetad only uses the HOSTNAME for the RRD files and its own processing. So if there is a duplicated hostname in the XML stream, it will update the RRDs after parsing the first entry, and then again after parsing the second. As these 2 updates to the same RRD files will occur in less than one second, this results in an RRD update error. On unpatched 3.0.2, this then causes THE ENTIRE PROCESSING OF THE CLUSTER TO BE ABANDONED. So some hosts get updated, some not, and the cluster view does not get updated. If you patch this particular issue, you will still get double processing for duped hosts, which can result in them erroneouly being reported as down (for example). phew. long mail. - richard -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Martin Knoblauch Sent: 30 March 2006 08:05 To: Eli Stair Cc: Ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML Eli, yup. That could definitely cause problems. Do you see anything in the /var/log/messages of the gmetad host? Hmm. You may have to restart *all* gmonds, as well as the gmetad. This is something that I usually do when my ganglia setup was hosed somehow. Definitely the case for multicast clusters. Not really sure about unicast. And yes - this is not optimal. --- Eli Stair [EMAIL PROTECTED] wrote: The only issue I can find at all with this config is that the new hosts have been deployed by someone with two PTR records, both the proper one pointing to the A hostname, as well as all having an improper PTR - linux.FQDN. Is there a potential that gmetad is doing a lookup of both the forward and reverse entries for a host before populating it? Unfortunately removing the invalid entry for a host and restarting gmetad as well as the gmond aggregator and the host did not resolve it. /eli Eli Stair wrote: My installation started having an issue yesterday afternoon that I have yet to explain or remedy. One cluster that I have unicasting, has started losing hosts... the directory entries on disk never get created for newly deployed hosts, and gmond reports receiving messages for the host (and outputs metrics) but gmetad does not report an updating host message, and never creates the RRD's even though the host is up. The critical problem is that the report graphs for this cluster have stopped being updated as well, which nix'es my ability to view cluster load/job level... in addition to not being able to alert on the RRD values for the individual hosts that are malfunctioning. Those hosts that are good continue to update their metric RRD's properly, their host reports are populated etc. The bad ones I cannot explain... The two questions, if anyone has insight: 1) What is causing
RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML
Eli, if the patch helps, I tend to put it into 3.0.3 (if CVS is working again :-( Martin --- Eli Stair [EMAIL PROTECTED] wrote: Richard, Martin, et al: Thanks for all your assistance describing the workings and why it is going wrong... the glomming together of all the host XML and the organization to disk of it has been quite a black box to me. You explain how this is can occur on an unpatched 3.0.2; is the recommended patch that which martin posted or is there something else suggested? I'll give his a shot, and if it isn't successfull try the CVS build. I've been trying to wait for 3.0.3 before making any more changes than just PHP interface stuff. Cheers, /eli -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Thu 3/30/2006 1:35 AM To: [EMAIL PROTECTED]; Eli Stair Cc: Ganglia-general@lists.sourceforge.net Subject: RE: [Ganglia-general] Re: gmetad not updating RRD's/hosts that areproper in gmond XML Eli, Martin is most surely right. If you are running an unpatched 3.0.2, let me share with you the many ways it can all go wrong. gmond generates the hostnames found in the XML stream by reverse DNS lookup only. Its internal structures treat every different IP address it sees as a different host, regardless of what the reverse DNS entry is. So, if you have 1) Incorrect reverse DNS entries such that 2 different hosts reverse map to the same hostname, 2) Or 2 NICs on a host that are not teamed (i.e. 2 different addresses) and the routing allows packets to exit either NIC, hence either source address may be used. 3) Or a DHCP lease renewal that results in a host changing IP addresses. Then what will happen is that the XML stream from the cluster will contain 2 (or more) entries with different IP addrs, but the same name. Even in the DHCP case when only 1 source address is used at a time, gmond will keep the old IP address entry until a timeout, even though it is not being updated. So dups arise again. Now unfortunately, gmetad only uses the HOSTNAME for the RRD files and its own processing. So if there is a duplicated hostname in the XML stream, it will update the RRDs after parsing the first entry, and then again after parsing the second. As these 2 updates to the same RRD files will occur in less than one second, this results in an RRD update error. On unpatched 3.0.2, this then causes THE ENTIRE PROCESSING OF THE CLUSTER TO BE ABANDONED. So some hosts get updated, some not, and the cluster view does not get updated. If you patch this particular issue, you will still get double processing for duped hosts, which can result in them erroneouly being reported as down (for example). phew. long mail. - richard -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Martin Knoblauch Sent: 30 March 2006 08:05 To: Eli Stair Cc: Ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Re: gmetad not updating RRD's/hosts that are proper in gmond XML Eli, yup. That could definitely cause problems. Do you see anything in the /var/log/messages of the gmetad host? Hmm. You may have to restart *all* gmonds, as well as the gmetad. This is something that I usually do when my ganglia setup was hosed somehow. Definitely the case for multicast clusters. Not really sure about unicast. And yes - this is not optimal. --- Eli Stair [EMAIL PROTECTED] wrote: The only issue I can find at all with this config is that the new hosts have been deployed by someone with two PTR records, both the proper one pointing to the A hostname, as well as all having an improper PTR - linux.FQDN. Is there a potential that gmetad is doing a lookup of both the forward and reverse entries for a host before populating it? Unfortunately removing the invalid entry for a host and restarting gmetad as well as the gmond aggregator and the host did not resolve it. /eli Eli Stair wrote: My installation started having an issue yesterday afternoon that I have yet to explain or remedy. One cluster that I have unicasting, has started losing hosts... the directory entries on disk never get created for newly deployed hosts, and gmond reports receiving messages for the host (and outputs metrics) but gmetad does not report an updating host message, and never creates the RRD's even though the host is up. The critical problem is that the report graphs for this cluster have stopped being updated as well, which nix'es my ability to view cluster load/job level... in addition to not being able to alert on the RRD values for the individual hosts that are malfunctioning. Those hosts that are good continue to update their metric RRD's properly, their host reports are populated etc. The bad ones I cannot explain... The two
[Ganglia-general] Ganglia 3.0.3 released
Hi, for those who do not track the SF release system, I have today released version 3.0.3 of Ganglia. The home page will be changed accordingly. Files can be downloaded from the SourceForge site. Source is available as tarball and SRPM. Binary RPMs have been built for RedHat FC4/i386. Development of version 3.0.4 is now open. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia 3.0.2 on Solaris 9
Hi Aravindh, a few questions: - which gcc are you using (gcc --version)? I can build 3.0.2 and 3.0.3 on Solaris 8 with gcc-3.3.1. - which make (gnu-make is recommended)? - On 64-bit platforms configure using: CC=gcc -m64 ./configure Oh yes, please try 3.0.3. Released yesterday :-) Martin --- [EMAIL PROTECTED] wrote: Hi all, I am getting the following error while installing Ganglia 3.0.2 on Solaris 9 machine. The error is as follows: #PosixConnector#make make all-recursive Making all in srclib Making all in libmetrics make all-recursive Making all in solaris if /bin/bash ../libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H -I. -I. -I.. -I/tmp/rrdtool/lb/include -I/tmp/rrdtool/lb/include/libart-2.0 -I/tmp/rrdtool/lb/include/freetype2 -I/tmp/rrdtool/lb/include/libpng -I.. -I../lib -O3 -D__EXTENSIONS__ -D_POSIX_C_SOURCE=199506L -DHAVE_STRERROR -MT metrics.lo -MD -MP -MF .deps/metrics.Tpo -c -o metrics.lo metrics.c; \ then mv -f .deps/metrics.Tpo .deps/metrics.Plo; else rm -f .deps/metrics.Tpo; exit 1; fi mkdir .libs gcc -DHAVE_CONFIG_H -I. -I. -I.. -I/tmp/rrdtool/lb/include -I/tmp/rrdtool/lb/include/libart-2.0 -I/tmp/rrdtool/lb/include/freetype2 -I/tmp/rrdtool/lb/include/libpng -I.. -I../lib -O3 -D__EXTENSIONS__ -D_POSIX_C_SOURCE=199506L -DHAVE_STRERROR -MT metrics.lo -MD -MP -MF .deps/metrics.Tpo -c metrics.c -fPIC -DPIC -o .libs/metrics.o metrics.c:167: error: static declaration of 'ncpus' follows non-static declaration /usr/include/sys/cpuvar.h:351: error: previous declaration of 'ncpus' was here metrics.c: In function 'percentages': metrics.c:306: warning: pointer targets in assignment differ in signedness *** Error code 1 make: Fatal error: Command failed for target `metrics.lo' Current working directory /opt/ganglia/ganglia-3.0.2/srclib/libmetrics/solaris *** Error code 1 make: Fatal error: Command failed for target `all-recursive' Current working directory /opt/ganglia/ganglia-3.0.2/srclib/libmetrics *** Error code 1 make: Fatal error: Command failed for target `all' Current working directory /opt/ganglia/ganglia-3.0.2/srclib/libmetrics *** Error code 1 make: Fatal error: Command failed for target `all-recursive' Current working directory /opt/ganglia/ganglia-3.0.2/srclib *** Error code 1 make: Fatal error: Command failed for target `all-recursive' Current working directory /opt/ganglia/ganglia-3.0.2 *** Error code 1 make: Fatal error: Command failed for target `all' #PosixConnector#pwd /opt/ganglia/ganglia-3.0.2 But the same thing is working fine on LinuxAny idea of how to solve this or any hints that would take me out of this... Thanks and Regards ARAVINDH VARADHARAJU (WTO1 - E-Enabling) Project Engineer Tel : +91- 80- 2852 0408 Extn.1053 Mobile : +91- 99860 17606 A Smile can take you MILES..! Keep Smiling and Have a Nice Day -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
RE: [Ganglia-general] Nodes Reported as Dead
as dead, even though they are not. Doing a `telnet computer 8649` gives the appropriate data. Get Fresh Data will usually change out which nodes are dead and given a 30min cycle most will have switched. 2) Even though this has been running for many hours, some of the alive nodes report inaccuracies. Like one node for example Last heartbeat received -209998 seconds ago Uptime -975 days, 16:27:49 Swap: Using 0.0 of -100Mb Booted: January 1, 1970 The inaccuracies change every so often and it will report correctly for a while. Most of those I don't care about but I think it may be a related problem. 3) The dead nodes are almost all spot on with their stats, and if you go to the node view and click the Get Fresh Data the Load and CPU Utilization do update in sync even though its reported as dead. Maybe I missed the keywords, but I was not able to find anything quite like this in the email archive. I would be very grateful if anyone has any clues as to what maybe going on. Thank you for your time, Chris Stackpole --- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=kkid0944bid$1720dat1642 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general For more information about Barclays Capital, please visit our web site at http://www.barcap.com. Internet communications are not secure and therefore the Barclays Group does not accept legal responsibility for the contents of this message. Although the Barclays Group operates anti-virus programmes, it does not accept responsibility for any damage whatsoever that is caused by viruses being passed. Any views or opinions presented are solely those of the author and do not necessarily represent those of the Barclays Group. Replies to this email may be monitored by the Barclays Group for operational or business reasons. --- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnkkid0944bid$1720dat1642 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] unable to write XML tree info
James, cool. No need to be sorry. This is actually valuable information, as this may hit others as well. How did you find out and where exactely is the php_value located in the config files? Thanks Martin --- James Trater [EMAIL PROTECTED] wrote: I figured it out. I had assumed that it was a gmetad problem, but it turned out to be a problem with PHP - specifically the amount of memory that PHP is allowed to allocate. I put this in my apache config file for ganglia: php_value memory_limit 32M and it works fine now. Sorry! Jim -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Fwd: compile ganglia 3.0.3 on SLES9 x86_64
Hi Bernard, --- Bernd Wenger [EMAIL PROTECTED] wrote: /usr/lib64/gcc-lib/x86_64-suse-linux/3.3.3/../../../../x86_64-suse-linux/bin/ld: cannot find -lpng collect2: ld returned 1 exit status make[2]: *** [gmetad] Error 1 make[2]: Leaving directory `/tmp/ganglia-3.0.3/gmetad' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/tmp/ganglia-3.0.3' make: *** [all] Error 2 you need to check whether you have libpng-devel installed. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
RE: [Ganglia-general] Fwd: compile ganglia 3.0.3 on SLES9 x86_64
Bernard, in addition to making libpng-devel a requirement for RPM builds, we also should/need-to check in configure. Apparently configure is a bit sloppy wrt. building gmetad. Any autoconf takers? :-) Martin --- Bernard Li [EMAIL PROTECTED] wrote: Hi Bernd: I had issues building on SLES9 x64 due to an issue with lib vs lib64 but I don't think that's your problem. It said that it cannot find -lpng - do you need to installing something like libpng-devel or something like that on SLES? If that's a requirement to build on SLES, I could update the spec file after we migrate our code repository from CVS - SVN. Cheers, Bernard -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Multicast issue on systems with multiple interfaces
Vladimir, I am afraid this is broken since 3.0.0 (or when we moved to apr). Matt wanted to look into it. Martin --- Vladimir Vuksan [EMAIL PROTECTED] wrote: - We just upgraded our Gangliacluster to 3.0.3 from 2.5.7. All of the systems have dual networkinterfaces. Most of the network traffic goes over eth1 interfacewhereas the control messages etc. go over eth0. In 2.5.7 we specifiedmcast_if to be eth0 and that works well. In 3.0.3 even though eth0 isspecified multicast traffic goes over eth1. Only way we have been ableto resolve it is to add a manual route for 239.0.0.0. Any clues aboutthis ? Vladimir ---Using Tomcat but need to do more? Need to support web services, security?Get stuff done quickly with pre-integrated technology to make your job easierDownload IBM WebSphere Application Server v.1.0.1 based on Apache Geronimohttp://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642___Ganglia-general mailing [EMAIL PROTECTED]://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia 3.0.3 compilation on AIX 5.2
Hi Knut, there is supposed to be a README.AIX file in the 3.0.3 distribution. This explains a few things. Basically, building with xlc is not supported. There are a few hints on how to do it under 2.) And you absolutely need to build non-shared. That is where most likely your core-dump comes from. Explained under 1) Cheers Martin --- Knut Helleb� [EMAIL PROTECTED] wrote: Regards, I'm trying to compile Ganglia 3.0.3 on an AIX 5.2 box using the native IBM compiler and have encountered two problems compiling and one fatal when running gmond. Compilation problems: 1. The compilation breaks on the file ./srclib/confuse/src/lexer.c at line 786 which stems from the lex file lexer.l line 82: #line 82 lexer.l cfg-line++; /* keep track of line number */ YY_BREAK saying undeclared identifier cfg. I put in a cfg_t *cfg; declaration in line 696 and then the compilation proceeds. 2. Also, I need to use the -qcpluscmt switch allowing C++ comment style or else the compilation bombs in gmond.c 3. Running gmond always crashes with a SIGSEGV. The trace shows that the crash occurs when opening the /etc/gmond.conf file. A dbx session on the core file shows the crash seems to be related to the parser file fix i did in section 1. above. Here's the backtrace: (dbx) where cfg_yylex() at 0x1000af28 cfg_parse_internal() at 0x1000821c cfg_parse_fp() at 0x1000a5a0 cfg_parse() at 0x1000a684 Ganglia_gmond_config_create() at 0x10006d58 process_configuration_file() at 0x100036dc main() at 0x14b4 What's up here ? -- ** * Knut Helleb� | DAMN GOOD COFFEE !! * * Hydro IS Partner ESI (Unix) Team | (and hot too) * * | * * E-mail: [EMAIL PROTECTED] | Dale Cooper, FBI * ** *** NOTICE: This e-mail transmission, and any documents, files or previous e-mail messages attached to it, may contain confidential or privileged information. If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of any of the information contained in or attached to this message is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notify the sender and delete the e-mail and attached documents. Thank you. *** -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] What's the meaning of Cached memory and Buffered memory
Hi Yongsheng, the meaning of cached/buffered depends on the architecture. If you are on Linux, cached describes the amount of memory that is used for the page cache, which usually means the pages used to speed up IO operations. It will not go down, unless all there is pressure for memory from other applications. buffered (In Linux) counts the pages used for filesystem meta-data (like directories). Cheers Martin --- Zhao, Yongsheng [EMAIL PROTECTED] wrote: Hello, When my application is running, the Memory cached is going up all the way to the top. And it does not return when the application is done. Any one know what is the Memory cached exactly, also what is Memory buffered? Thanks. Yongsheng - -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
RE: [Ganglia-general] What's the meaning of Cached memory and Buffered memory
Hi Yongsheng, the only sure way to get it down is reboot. Another way mybe to unmount/mount all filesystems (which does not work for / :-) But there is no need to worry about cached. It will go away automatically if an application wants the memory. Oh - you could write an application that mallocs lots of memory (as much as you have). This will push away the cache. On exit, the application memory is freed. But as I said, everything is fine. Cheers Martin --- Zhao, Yongsheng [EMAIL PROTECTED] wrote: Hello, Martin: Thanks for the anwer. We are on Linux. Are there commands or utilities which can reset the cached memory to its original value? Thanks. Yongsheng -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] New issue with hosts reporting
Hi Mark, you have configured a tcp_accept_channel for each of your two clusters master gmonds? Then you may need to define an acl for your gmetad server. Something like: tcp_accept_channel { port = 8649 acl { default = deny access { ip = ip-of-the-gmetad-server mask = 32 action = allow } } } Cheers Martin --- Mark Haney [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 David Zaltron wrote: Probably you have a gmond configuration on each node that muticast the cluster status to every node. For example, if you have a configuration like this in the nodes: - cluster { name = dummy_cluster } udp_send_channel { mcast_join = 239.2.11.71 port = 8649 } udp_recv_channel { mcast_join = 239.2.11.71 port = 8649 bind = 239.2.11.71 } This means that every node know to belong to the dummy_cluster, and every gmond can return the status of the entire cluster because it knows about every each other node (talking in the same multicast channel with each other) if telled at the default 8649 TCP port. You can find the solution unicasting the traffic between the node itself: udp_send_channel { host = hostname of 127.0.0.1 port = 8649 } udp_recv_channel { port = 8649 } --- In this way you can simulate a cluster of a single node, monitoring in reality the single node. Okay, I did that and that /sort of/ fixed it, except for now I do not see the nodes in my web interface. Keep in mind the web interface is running on a completely separate box that's not either newton or winterstar. So, how do I get the node showing up in the web interface now? (And David, I apologize for sending to you and not the list, my fingers got ahead of me today.) - -- Fere libenter homines id quod volunt credunt. Mark Haney Sr. Systems Administrator ERC Broadband (828) 350-2415 -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFEhDXZYQhnfRtc0AIRAj07AJwNaTsNHM02oJaznXnO0qECZEPZUwCfa6JR 0rLX5KWkRW9MjL/5/J/Igj0= =iIJp -END PGP SIGNATURE- ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia History
Adam, that is unexpected. The RRDs are supposed to keep one year (the default) of history. Martin --- Adam Brust [EMAIL PROTECTED] wrote: I recently had to reboot the Front End of my cluster... upon the reboot, my Ganglia history is gone... Gangila is only keeping data from the time of the reboot... it was nearly a years worth of history... can anyone offer any suggestions? thanks, -adam ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia History
Adam, do you still have those error messages? And: which version of the web-frontend are you using? We fixed quite a few of the php messages in 3.0.3. Martin --- Adam Brust [EMAIL PROTECTED] wrote: At the beginning of the month, ganglia/php were producing massive amounts of httpd errors which filled up my / partition causing the machine to crash... since then, I believe my ganglia history had been effected... I tried to restore from the three tar files located in /var/lib/ganglia/archives/ and each one only had about a weeks worth of history... I was able to restore from an earlier backup, which has my previous history, although now I am missing roughly these last three weeks. Also, I'm not certain if the problem is corrected now... I don't know if I'll lose this history again upon a reboot. -adam Martin Knoblauch wrote: Adam, that sounds OK. Do you see any messages in either /var/log/messages or in your webservers log files? Martin --- Adam Brust [EMAIL PROTECTED] wrote: Ian, Thanks for your reply. My rrd files appear to in the default /var/lib/ganglia directory, I could not find any other instances of them. gmetad is running as nobody and the rrds are owned by nobody... do you know if that's the correct user/permissions? thanks, adam Ian Cunningham wrote: Look at where gmetad is storing the rrd files now. You can find it in your gmetad.conf under rrd_rootdir. Maybe you didn't specify it for -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] not showing all hosts
--- Ian Cunningham [EMAIL PROTECTED] wrote: Solution B: increase the Time To Live or ttl on the gmond multicast packets. This assumes that multicast packets can get from one vlan to the other. The configuration option used to be available in the 2.x codebase, but I don't see it in 3.0.x code. I think it would be mcast_ttl but I can't say if that will work or not. it is ttl in the udp_send_channel section. It will be used, if mcast_join is set. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] changed ip
Hi Toney, my first guess would be that you are: a) using multicast and b) your default gateway goes via eth0 c) your compute nodes are on the 192.168.180.x network After the change the MC packets are still expected via eth0, but come in from eth1. Try adding this from the documentation: mcast_if=eth1 in your headnodes gmond.conf and route add -host 239.2.11.71 dev eth1 Hope this helps Martin --- toney samuel [EMAIL PROTECTED] wrote: I have a 4 node cluster. my head node has got two gigabit card and infiniband card my cluster ip is eth0 192.168.180.17/255.255.252.0 ipoib0 192.168.0.1/255.255.255.0 I have installed ganglia with this configuration. ganglia was working properly. later i changed my network configuration to this eth0 192.168.1.1/255.255.255.0 eth1 192.168.180.17/255.255.252.0 ipoib0 192.168.0.1/255.255.255.0 Now i can't see any information in my web page Pls guide how to resolve this issue. Regards. -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Troubles linking: Linux (SUSE 9.3) on Itanium (ia64, Altix)
On a RedHat-ish distro you would need to check that the RPMs for libpng *and* libpng-devel are installed. Not sure about SuSE though. Martin --- Ryurick Marius Hristev [EMAIL PROTECTED] wrote: Hello, I was trying to compile the ganglia package (rpm version) on the following system: SuSE 9.3 (Linux) running on Itaniums (ia64, SGI Altix ) and I am getting this error: gcc -O0 -I../lib -I../gmond -I../srclib/expat/lib/ -g -O2 -Wall -D_REENTRANT -o gmetad gmetad.o cmdline.o data_thread.o server.o process_xml.o rrd_helpers.o conf.o type_hash.o xml_hash.o cleanup.o ../lib/.libs/libganglia.a /usr/lib/librrd.a -lpng -lz -lm ../srclib/expat/lib/.libs/libexpat.a -ldl -lresolv -lnsl -lpthread /usr/lib/gcc-lib/ia64-suse-linux/3.3.3/../../../../ia64-suse-linux/bin/ld: cannot find -lpng but I do have a /usr/lib/libpng.so.3 Are there any known quirks with respect to my OS/Distro and CPU/Machine ? (I am new to this one, apologies if I missed something obvious). TIA Cheers, -- Ryurick M. Hristev -- Systems Administrator (Unix) University of Queensland -- ITS Dept. mailto: [EMAIL PROTECTED] the greatest hacking experience: hack your own mind -- me - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Obtaining Immediate Interval Data From Ganglia
Correct. Below code limits the sampling rate for the cpu*, load*, mem* and net* graphs. Setting them to 0 will give you 1 second accuracy. Or nice furry graphs as Richard said (actually the furriness is what the original authors wanted to prevent :-). Personally I doubt that sampling load* and mem* at that rate. cpu* and net* may make sense. Richard, yes please file a report. Unfortunatelly I spoke to soon when I mentioned that we should get rid of the intervalls at all. Reason is that we need to compute differences for the cpu* and net* metrics (they are rates after all). If we want to have sub-second sampling rates, we need to use getimeofday instead of time. --- [EMAIL PROTECTED] wrote: If you do want to do fast polling on the Linux or cygwin gmond, I found some hardwired code in there which effectively limits the polling rate for some metrics no matter what you put in the config files. (Sorry martin, have not raised a bug report yet). Anyway: the code below is in the cygwin and linux metric.c files. typedef struct { uint32_t last_read; uint32_t thresh; char *name; char buffer[BUFFSIZE]; } timely_file; timely_file proc_stat= { 0, 15, /proc/stat }; timely_file proc_loadavg = { 0, 15, /proc/loadavg }; timely_file proc_meminfo = { 0, 30, /proc/meminfo }; timely_file proc_net_dev = { 0, 30, /proc/net/dev }; char *update_file(timely_file *tf) { int now,rval; now = time(0); if(now - tf-last_read tf-thresh) { rval = slurpfile(tf-name, tf-buffer, BUFFSIZE); if(rval == SYNAPSE_FAILURE) { err_msg(update_file() got an error from slurpfile() reading %s, tf-name); return (char *)SYNAPSE_FAILURE; } else tf-last_read = now; } return tf-buffer; } I have set those timeout values zero, which works well and gives me nice spiky furry graphs. - richard -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] monitoring
Nagios? Cheers Martin --- Dirk Roessler [EMAIL PROTECTED] wrote: Does someone knows an easy to install and easy to use solution for monitoring and sending email notifications of down nodes and health state on a Linux HPC cluster? Dirk begin:vcard fn;quoted-printable:Dirk R=C3=B6=C3=9Fler n;quoted-printable:R=C3=B6=C3=9Fler;Dirk org:_University of Potsdam;Department of Geosciences adr:;;K.-Liebknecht-Str. 24/25;Golm/Potsdam;;14476;Germany email;internet:[EMAIL PROTECTED] title:Geophysicist tel;work:+49 331 977 5795 tel;fax:+49 331 977 5700 x-mozilla-html:FALSE url:http://www.geo.uni-potsdam.de/mitarbeiter/Roessler/roessler.html version:2.1 end:vcard - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia scaling testing?
-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057; dat=121642 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnkkid=120709bid=263057dat=121642 ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] First Snapshot for 3.0.4
--- Bernard Li [EMAIL PROTECTED] wrote: It is the first release after moving from CVS to SVN. Changes compared to 3.0.3 are: - Fix bz #110 by allowing higher sampling rates for cpu/net/load/mem in Linux/Cygwin. Likely needs similar changes in other platforms. - Add Yemis Host-Spoofing patch (bz #99) - Fix bz #77 (Diskless NFS Root not treated correctly) - Compile fixes for IRIX (bz #73/79) - Fix locking problems in gmetad (bz #56) - Fix incorrect writing of RRDs (bz #105) - Increases the number of rows in newly created RRAs (bz #33) - Better handling of bonding interfaces in Linux (bz #102/104) - Fix for network metrics overrun by Andreas Schoenfeld in AIX - SVN related cleanups in distribution targets - Take some of the proposed AIX changes from Micheal Perzl. The real stuff will come in 3.1.x I would also add: - Better RPM support for SUSE Linux 10.0/10.1 x86 and x86_64 Cheers, Bernard Oops. Sorry. Yes, the list is not neccessarily complete. I should also have mentioned the generated ChangeLog, which gives some more info. Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Problem with metrics
--- Ben Hartshorne [EMAIL PROTECTED] wrote: On Tue, Sep 19, 2006 at 03:11:26PM +0200, Rafal Masztalerz wrote: Hi I added some new metrics for my ganglia software using the gmetric command. When I run the webpage without parameters : http://computer/ganglia/ everything seems to be ok and I can choose my new metrics. But when I try to do other things on this page, for expample, when I choose some metric (bytes_out) then there are no my new metrics on the new/refreshed page. http://computer/ganglia/?m=bytes_outr=hours=descendingc=comph=sh=1hc=4 Rafael, Be careful that your metric only sends numbers. In some versions of ganglia, if your script that reports the gmetric accidentally sends letters instead, Bad Things(tm) happen. I wrote a script to parse the output of 'who' to count the number of logged in users, but I did it wrong. Occasionally it got a word instead of a number. This caused unexplained metric-loss throughout my gangila installation. A newer version of gmetric fixed this problem, but it is a good place to -ben -- Ben Hartshorne email: [EMAIL PROTECTED] http://ben.hartshorne.net - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general start looking. I'm sorry, but I don't remember what versions are affected. The fix for the gmetric bug went in on 25-Jan-2006. So, it should be in 3.0.3. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
[Ganglia-general] New/Last Snapshot for 3.0.4
Hi, please have a look at the 2nd 3.0.4 snapshot located at: http://www.knobisoft.de/ganglia/ganglia-3.0.4.200609241751.tar.gz This snapshot contains the following changes compared to the last one: - fixup of the corrupted JPG images - move libmetrics to top-level in order to prepare removal of external sources in 3.1 - fix a stray debug message going to STDOUT instead of SDTERR - fix two stupid HP-UX syntax errors reported ages ago The full list of Changes is in the ChangeLog. There has not been a lot of feedback since the first snapshot. If nothing serious comes out during the next week, I will push out 3.0.4. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] [Ganglia-developers] Ganglia 3.0.4 released
--- Carlo Marcelo Arenas Belon [EMAIL PROTECTED] wrote: On Mon, Dec 25, 2006 at 02:32:30AM -0800, Martin Knoblauch wrote: Ho ho ho, Santa just released version 3.0.4 of Ganglia. This is mainly a bugfix release. See the ChangeLog in the tarball for a complete list of changes. thanks Santa, and I got to be the first kid that went to the sourceforge tree for the nicely wrapped package :) which was far nicer than that Wii that Matt is probably still waiting to get a hold of. since I was running tests on the last SVN anyway, I got some more platforms where gmond/gmetric (and therefore libmetrics) were tested (*): * Gentoo Linux 2006.1 (amd64), Fedora Core 6 (i386) * Solaris 9 (sparc), Solaris 10 (i386, amd64 and sparc) * NetBSD 2.0.2 (i386), NetBSD 3.0 (i386), NetBSD 3.1 (i386, amd64) * FreeBSD 6.1 (amd64) Hi Carlo, thanks for the feedback. Could you just tell us which toolchains were used on the non-Linux platforms? Especially which compiler? Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia+OpenBSD?
Jason, apparently configure fails to realize that you are on OpenBSD, which is not supported currently. The unknown part is telling. In order to support OpenBSD one needs to fix the recognition process in configure and add OpenBSD-specific metrics code to libmetrics. So I am afraid that it is not as easy as you believe. Btw. what is the output of config/config.guess? Cheers Martin --- Jason Faulkner [EMAIL PROTECTED] wrote: Anybody have even a direction to point me in? I'm at my wits end. Jason Faulkner wrote: I've been trying all morning (about 5 hours now, heh) to get Ganglia 3.0.3 to compile on OpenBSD to no avail. Here's the error it spits at me: ./configure --prefix=/opt ran without a hitch, but when I said make... /bin/sh ../libtool --tag=CC --mode=link /usr/bin/gcc -I.. -I. -I../srclib/expat/lib/ -I../srclib/libmetrics/ -I../srclib/apr/include/ -I../srclib/apr/include/arch/unix/ -I../srclib/confuse/src -g -O2 -Wall-o libganglia.la -rpath /opt/lib -version-info 0:0:0 -release 3.0.3 -export-dynamic become_a_nobody.lo debug_msg.lo daemon_init.lo file.lo dotconf.lo error.lo ganglia.lo hash.lo inetaddr.lo llist.lo my_inet_ntop.lo rdwr.lo readdir.lo tcp.lo protocol_xdr.lo apr_net.lo libgmond.lo -lkvm -lresolv -lpthread *** Warning: linker path does not have real file for library -lresolv. *** I have the capability to make that library automatically link in when *** you link to this library. But I can only do this if you have a *** shared version of the library, which you do not appear to have *** because I did check the linker path looking for a file starting *** with libresolv and none of the candidates passed a file format test *** using a regex pattern. Last file checked: /usr/lib//libresolv.a *** The inter-library dependencies that have been dropped here will be *** automatically added whenever a program is linked with this library *** or is declared to -dlopen it. /usr/bin/gcc -shared -fPIC -DPIC -o .libs/libganglia-3.0.3.so.0.0 .libs/become_a_nobody.o .libs/debug_msg.o .libs/daemon_init.o .libs/file.o .libs/dotconf.o .libs/error.o .libs/ganglia.o .libs/hash.o .libs/inetaddr.o .libs/llist.o .libs/my_inet_ntop.o .libs/rdwr.o .libs/readdir.o .libs/tcp.o .libs/protocol_xdr.o .libs/apr_net.o .libs/libgmond.o -lkvm -lpthread (cd .libs rm -f libganglia.so.0.0 ln -s libganglia-3.0.3.so.0.0 libganglia.so.0.0) ar cru .libs/libganglia.a become_a_nobody.o debug_msg.o daemon_init.o file.o dotconf.o error.o ganglia.o hash.o inetaddr.o llist.o my_inet_ntop.o rdwr.o readdir.o tcp.o protocol_xdr.o apr_net.o libgmond.o ranlib .libs/libganglia.a creating libganglia.la (cd .libs rm -f libganglia.la ln -s ../libganglia.la libganglia.la) Making all in srclib Making all in libmetrics make all-recursive Making all in unknown /bin/sh: cd: /usr/src/ganglia-3.0.3/srclib/libmetrics/unknown - No such file or directory *** Error code 1 Stop in /usr/src/ganglia-3.0.3/srclib/libmetrics (line 342 of Makefile). *** Error code 1 Stop in /usr/src/ganglia-3.0.3/srclib/libmetrics (line 204 of Makefile). *** Error code 1 Stop in /usr/src/ganglia-3.0.3/srclib (line 243 of Makefile). *** Error code 1 Stop in /usr/src/ganglia-3.0.3 (line 332 of Makefile). *** Error code 1 Stop in /usr/src/ganglia-3.0.3 (line 214 of Makefile). This is on OpenBSD 3.8. -- Jason Faulkner Systems Manager Broadwick Corporation (919) 459-2509 - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia+OpenBSD?
--- Jason Faulkner [EMAIL PROTECTED] wrote: http://j.oldos.org/configguess.txt I feel less than smart. You wanted this, didn't you: :-) [EMAIL PROTECTED]:/usr/src/ganglia-3.0.3/config$ ./config.guess i386-unknown-openbsd3.8 guess this explains the unknown. But from the other follow-ups there seems to be hope for you. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia+OpenBSD?
--- Carlo Marcelo Arenas Belon [EMAIL PROTECTED] wrote: On Tue, Dec 26, 2006 at 02:38:01PM -0500, Jason Faulkner wrote: Ooops -- sent first email directly to Martin instead of list. Martin Knoblauch wrote: Jason, apparently configure fails to realize that you are on OpenBSD, which is not supported currently. The unknown part is telling. I thought that might be the case. In order to support OpenBSD one needs to fix the recognition process in configure and add OpenBSD-specific metrics code to libmetrics. I'm confused though, according to this page: http://sourceforge.net/projects/ganglia/ ganglia runs on all openbsd platforms. I was going on the, apparently false, presumption that this meant the libmetrics code already existed for openbsd. not in 3.0.4, but I have a rough version that will be hopefully merged for 3.0.5 and that so far compiles and works (not all metrics though) in the hosts i have to test: OpenBSD 3.7 (i386) OpenBSD 4.0 (i386 and amd64)) IANAP, but if there's anything I can do to help get this working on OpenBSD, let me know. what versions/arch are you interested on?, would you be able to deploy test snapshots of ganglia on them? Carlo Carlo, I see no problem to add OpenBSD support in 3.0.5. Just go on and check it in once you are satisfied with your stuff. Just out of curiosity: how similar are the BSD flavours? We already have NetBSD and FreeBSD support in. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] gmond problem on SLES 10 x64 with floats
Hi Ludovic, do you happen to have some stange/unusual setting of your locale (LANG variable and friends) when you start the gmond executable? The output definitely looks broken. Could you please file a bug on bugzilla? Cheers Martin --- Ludovic Drolez [EMAIL PROTECTED] wrote: Hi ! I installed the official Ganglian RPM on a SLES 10 x64. My graphs are really strange, and the percentage values show random characters. I've just found that the problem is in gmond, which sends random strings in the XML dialog. I've tried to recompile gmond, but I have still the same problem. Here's some of the strace output: = accept(6, {sa_family=AF_INET, sin_port=htons(43998), sin_addr=inet_addr(127.0.0.1)}, [17179869200]) = 9 write(9, ?xml version=\1.0\ encoding=\ISO-8859-1\ standalone=\yes\?\n!DOCTYPE GANGLIA_XML [\n !ELEMENT G..., 2328) = 2328 write(9, GANGLIA_XML VERSION=\3.0.3\ SOURCE=\gmond\\n, 45) = 45 write(9, CLUSTER NAME=\cluster\ LOCALTIME=\1166087533\ OWNER=\unspecified\ LATLONG=\unspecified\ URL=\unspe..., 108) = 108 write(9, HOST NAME=\master.localdomain\ IP=\192.168.0.106\ REPORTED=\1166087527\ TN=\5\ TMAX=\20\ DMAX=\0\ ..., 150) = 150 write(9, METRIC NAME=\disk_total\ VAL=\1A.\332\326\260\ TYPE=\double\ UNITS=\GB\ TN=\1500\ TMAX=\1200\ DMAX=\0\ SLOP..., 125) = 125 write(9, METRIC NAME=\cpu_speed\ VAL=\2993\ TYPE=\uint32\ UNITS=\MHz\ TN=\300\ TMAX=\1200\ DMAX=\0\ SLOPE=\..., 122) = 122 write(9, METRIC NAME=\part_max_used\ VAL=\7y.\n\ TYPE=\float\ UNITS=\\ TN=\60\ TMAX=\180\ DMAX=\0\ SLOPE=\bo..., 120) = 120 write(9, METRIC NAME=\swap_total\ VAL=\4194296\ TYPE=\uint32\ UNITS=\KB\ TN=\300\ TMAX=\1200\ DMAX=\0\ SLOP..., 125) = 125 write(9, METRIC NAME=\os_name\ VAL=\Linux\ TYPE=\string\ UNITS=\\ TN=\300\ TMAX=\1200\ DMAX=\0\ SLOPE=\zero..., 118) = 118 write(9, METRIC NAME=\cpu_user\ VAL=\2.F\ TYPE=\float\ UNITS=\%\ TN=\20\ TMAX=\90\ DMAX=\0\ SLOPE=\both\ SO..., 114) = 114 write(9, METRIC NAME=\cpu_system\ VAL=\3.0\ TYPE=\float\ UNITS=\%\ TN=\20\ TMAX=\90\ DMAX=\0\ SLOPE=\both\ ..., 116) = 116 = As you can see, there's garbage for disk_total, part_max_used, cpu_user... So all values of type float or double, are not properly converted. The SLES runs under Qemu. I've also added some printfs in the host_metric_value and here's what I get: On the left the float converted by apr_* and on the right the prinf(%f) !!! VALUE =2.G= =2.343750= VALUE =2.G= =2.343750= VALUE =9.Ö= =93.487236= VALUE =0.6o= =0.64= VALUE =0.1;= =0.119600= VALUE =0.00= =0.000311= VALUE =0.0= =0.00= VALUE =0.0= =0.00= VALUE =9.ê= =95.312500= VALUE =0.9= =0.94= VALUE =0.4Y= =0.42= VALUE =0.1;= =0.113054= VALUE =0.00= =0.000536= Any ideas ? Cheers, -- Ludovic DROLEZ Linbox / FreeALter Soft www.linbox.com www.linbox.org - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] [Ganglia-developers] Correct counting of CPUs, Cores, Siblings (bz #84)
Hi Jarod, thanks. Your and Bens input were really useful for detecting patterns in 2.6 based configurations. What I now need is the output from 2.4 based configs. Only multi-core and/or HT-enabled systems actually. Thanks and have a Godd new Year 2007 Martin --- Jarod Wilson [EMAIL PROTECTED] wrote: On Friday 22 December 2006 11:05, Martin Knoblauch wrote: Hi Folks, in order to fix bz#84 for Linux, I would like to collect some data from different system configurations. Could you please create the file cpu.grep and execute the cat/grep chain below. Please report the results together with uname -a output which distro you are running. # more cpu.grep processor vendor model name physical id siblings core id cpu cores # cat /proc/cpuinfo | grep -f cpu.grep Here's the data from my Fedora Core 6 workstation in the office, since its fairly interesting for this specific topic. Its a dual-socket, dual-core Xeon system with hyperthreading turned on, so two sockets, four cores, eight logical cpus... Linux xavier.boston.redhat.com 2.6.18-1.2849.fc6 #1 SMP Fri Nov 10 12:34:46 EST 2006 x86_64 x86_64 x86_64 GNU/Linux processor : 0 vendor_id : GenuineIntel model name : Intel(R) Xeon(TM) CPU 3.00GHz physical id : 0 siblings: 4 core id : 0 cpu cores : 2 processor : 1 vendor_id : GenuineIntel model name : Intel(R) Xeon(TM) CPU 3.00GHz physical id : 1 siblings: 4 core id : 0 cpu cores : 2 processor : 2 vendor_id : GenuineIntel model name : Intel(R) Xeon(TM) CPU 3.00GHz physical id : 0 siblings: 4 core id : 1 cpu cores : 2 processor : 3 vendor_id : GenuineIntel model name : Intel(R) Xeon(TM) CPU 3.00GHz physical id : 1 siblings: 4 core id : 1 cpu cores : 2 processor : 4 vendor_id : GenuineIntel model name : Intel(R) Xeon(TM) CPU 3.00GHz physical id : 0 siblings: 4 core id : 0 cpu cores : 2 processor : 5 vendor_id : GenuineIntel model name : Intel(R) Xeon(TM) CPU 3.00GHz physical id : 1 siblings: 4 core id : 0 cpu cores : 2 processor : 6 vendor_id : GenuineIntel model name : Intel(R) Xeon(TM) CPU 3.00GHz physical id : 0 siblings: 4 core id : 1 cpu cores : 2 processor : 7 vendor_id : GenuineIntel model name : Intel(R) Xeon(TM) CPU 3.00GHz physical id : 1 siblings: 4 core id : 1 cpu cores : 2 -- Jarod Wilson [EMAIL PROTECTED] - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/ganglia-developers -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Ganglia+OpenBSD?
--- Carlo Marcelo Arenas Belon [EMAIL PROTECTED] wrote: On Wed, Dec 27, 2006 at 12:38:00AM -0800, Martin Knoblauch wrote: I see no problem to add OpenBSD support in 3.0.5. Just go on and check it in once you are satisfied with your stuff. checked it in already in revision 697. saw it. Just out of curiosity: how similar are the BSD flavours? We already have NetBSD and FreeBSD support in. I used NetBSD as a base from my port (as it is the closest), sadly they are not that similar as to just work with the other source as you can see by the diff. Understand. Btw. you should check the use of the strings NetBSD / FreeBSD in you patch :-) DragonflyBSD will be most likely closer to FreeBSD and the same for MacOS X (AKA Darwin), but I have no interest on adding those yet (DragonFlyBSD could be an interesting option for clusters, but I'd heard of no one using it in a cluster yet). You realize that we already have a Darwin port, although I do not know the quality/completeness of the metrics code. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Compatibility mode for gmetad?
--- Jason Faulkner [EMAIL PROTECTED] wrote: I'm curious about how possible or difficult it would be to make gmetad backwards compatible -- i.e. where I could leave my 2.5.x gmond installations alone, and install 3.x gmetad on my main server (and be able to collect stats despite having a heterogeneous 2.5.x and 3.x environment). This would allow me to (hopefully) live-migrate my ganglia install up to the new version. -- Jason Faulkner Systems Manager Broadwick Corporation (919) 459-2509 Hi Jason, although we bumped the major number in the 2.5.x - 3.0 transition, we took care to not introduce incompatible changes to the core metrics framework. In short, I see no reason why a 3.0.4 gmetad should not be able to query 2.5.x gmond data. It should even be possible to have a 3.0.4 gmond listen to older gmonds. Of course, you are limited to multicast until you have replaced all gmonds. Just try it out. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Windows port issues
--- Vladimir Vuksan [EMAIL PROTECTED] wrote: matt massie wrote: you need to install the cygwin sunrpc package which is not installed by default during the cygwin install... That was it. I still wasn't able to compile 3.0.4 (xdr_create? can't be find) however 3.0.3 compiles with no problem. could you be more specific on the error message? Is it compile time, or link time? There is no such thing as xdr_create. Maybe xdrmem_create. Who is the person that packaged it initially since 3.0.3 corrects the Wait CPU issue ie. instead of showing 100% idle shows 100% Wait CPU. Also it may be nice to include gmetric. Hmm. What package are you refering to? There is no official windows (cygwin) binary distribution. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Compatibility mode for gmetad?
--- Jason Faulkner [EMAIL PROTECTED] wrote: Martin Knoblauch wrote: --- Jason Faulkner [EMAIL PROTECTED] wrote: I'm curious about how possible or difficult it would be to make gmetad backwards compatible -- i.e. where I could leave my 2.5.x gmond installations alone, and install 3.x gmetad on my main server (and be able to collect stats despite having a heterogeneous 2.5.x and 3.x environment). This would allow me to (hopefully) live-migrate my ganglia install up to the new version. -- Jason Faulkner Systems Manager Broadwick Corporation (919) 459-2509 Hi Jason, although we bumped the major number in the 2.5.x - 3.0 transition, we took care to not introduce incompatible changes to the core metrics framework. In short, I see no reason why a 3.0.4 gmetad should not be able to query 2.5.x gmond data. It should even be possible to have a 3.0.4 gmond listen to older gmonds. Of course, you are limited to multicast until you have replaced all gmonds. Jan 3 23:12:07 intranet1 ./gmetad[25006]: RRD_update (/var/lib/ganglia/rrds/Dev Login Servers/__SummaryInfo__/part_max_used.rrd): illegal attempt to update using time 1167883927 when last update time is 1167883927 (minimum one second step) I've been receiving repeated errors like this attempting to use a 3.0.x gmetad with a 2.5.7 gmond. The times are synced perfectly to a local NTP server, so I'm sure that's not the issue. Not an NTP issue, you are most likely right. The message tells that the current timestamp for the metrics in question did not change from the previous invocation of the call. Does this only happen on part_max_used, or are other metrics showing up as well? part_max_used is likely changeing very slow, this might be an indicator. also interesting to note that in your example the metrics is not a host, but a summary metrics. Does it prevent useful operation of the 3.0.x gmetad together with 2.5.7 gmonds? Or is it just annoying? Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Windows port issues
--- Vladimir [EMAIL PROTECTED] wrote: Martin Knoblauch wrote: could you be more specific on the error message? Is it compile time, or link time? There is no such thing as xdr_create. Maybe xdrmem_create. Sorry I should have been more precise. It is a linking error. Here is the log gmond.o: In function `Ganglia_collection_group_send': /ganglia-3.0.4/gmond/gmond.c:1633: undefined reference to `_xdrmem_create' gmond.o: In function `main': /ganglia-3.0.4/gmond/gmond.c:897: undefined reference to `_xdrmem_create' /ganglia-3.0.4/gmond/gmond.c:828: undefined reference to `_xdr_free' /ganglia-3.0.4/gmond/gmond.c:912: undefined reference to `_xdr_free' ../lib/.libs/libganglia.a(libgmond.o): In function `Ganglia_gmetric_send': /ganglia-3.0.4/lib/libgmond.c:695: undefined reference to `_xdrmem_create' ../lib/.libs/libganglia.a(libgmond.o): In function `Ganglia_gmetric_send_spoof': /ganglia-3.0.4/lib/libgmond.c:748: undefined reference to `_xdrmem_create' ../lib/.libs/libganglia.a(protocol_xdr.o): In function `xdr_Ganglia_value_types': /ganglia-3.0.4/lib/protocol_xdr.c:13: undefined reference to `_xdr_enum' ../lib/.libs/libganglia.a(protocol_xdr.o): In function `xdr_Ganglia_gmetric_message': /ganglia-3.0.4/lib/protocol_xdr.c:23: undefined reference to `_xdr_string' /ganglia-3.0.4/lib/protocol_xdr.c:25: undefined reference to `_xdr_string' /ganglia-3.0.4/lib/protocol_xdr.c:27: undefined reference to `_xdr_string' /ganglia-3.0.4/lib/protocol_xdr.c:29: undefined reference to `_xdr_string' /ganglia-3.0.4/lib/protocol_xdr.c:31: undefined reference to `_xdr_u_int' /ganglia-3.0.4/lib/protocol_xdr.c:33: undefined reference to `_xdr_u_int' /ganglia-3.0.4/lib/protocol_xdr.c:35: undefined reference to `_xdr_u_int' ../lib/.libs/libganglia.a(protocol_xdr.o): In function `xdr_Ganglia_spoof_header': /ganglia-3.0.4/lib/protocol_xdr.c:45: undefined reference to `_xdr_string' /ganglia-3.0.4/lib/protocol_xdr.c:47: undefined reference to `_xdr_string' ../lib/.libs/libganglia.a(protocol_xdr.o): In function `xdr_Ganglia_message_formats': /ganglia-3.0.4/lib/protocol_xdr.c:69: undefined reference to `_xdr_enum' ../lib/.libs/libganglia.a(protocol_xdr.o): In function `xdr_Ganglia_message': /ganglia-3.0.4/lib/protocol_xdr.c:116: undefined reference to `_xdr_u_int' /ganglia-3.0.4/lib/protocol_xdr.c:124: undefined reference to `_xdr_string' /ganglia-3.0.4/lib/protocol_xdr.c:151: undefined reference to `_xdr_float' /ganglia-3.0.4/lib/protocol_xdr.c:156: undefined reference to `_xdr_double' /ganglia-3.0.4/lib/protocol_xdr.c:95: undefined reference to `_xdr_u_short' ../lib/.libs/libganglia.a(protocol_xdr.o): In function `xdr_Ganglia_25metric': /ganglia-3.0.4/lib/protocol_xdr.c:170: undefined reference to `_xdr_int' /ganglia-3.0.4/lib/protocol_xdr.c:172: undefined reference to `_xdr_string' /ganglia-3.0.4/lib/protocol_xdr.c:174: undefined reference to `_xdr_int' /ganglia-3.0.4/lib/protocol_xdr.c:178: undefined reference to `_xdr_string' /ganglia-3.0.4/lib/protocol_xdr.c:180: undefined reference to `_xdr_string' /ganglia-3.0.4/lib/protocol_xdr.c:182: undefined reference to `_xdr_string' /ganglia-3.0.4/lib/protocol_xdr.c:184: undefined reference to `_xdr_int' collect2: ld returned 1 exit status make[3]: *** [gmond.exe] Error 1 make[2]: *** [all-recursive] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all] Error 2 OK, seems ld is unable to find all of the xdr functions. Maybe someone removed a library from the library list. Although under Linux those functions are in libc. Hmm. What package are you refering to? There is no official windows (cygwin) binary distribution. Perhaps it is unofficial but it is on SourceForge e.g. http://downloads.sourceforge.net/ganglia/ganglia-3.0.0-setup.exe?modtime=1107790662big_mirror=0 Ah. I forgot about this one. And I do not recall who donated the work. I am adding the developers list. Apparently, the installer was never updated after the initial release. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Two similar linux hosts provides different metrics
Hi Vitaly, what does ps axl show on both hosts, as that is basically what gmond looks at? If it is already different there, the problem is not ganglia related. (OK, I see you already checked ...) What are the load averages according to uptime? Cheers Martin --- Vitaly Karasik [EMAIL PROTECTED] wrote: Hi, I have a weird problem - two linux hosts with similar configuration provide very different metrics about number of running processes - one shows about 2, and second about 20-40 (I speak about concentrated load graph at top right.) proc_total is different too - 171 vs. 350 (BTW, ps -ef |wc == 61 on both boxes) Both machines are RHEL3 kernel 2.4.21-37.ELsmp with ganglia-gmond-3.0.3-1 installed from RPM. Any ideas? Thanks, Vitaly - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Two similar linux hosts provides different metrics
Hi Vitaly, where do you see the invalid numbers: a) in the gmond XML Stream (telnet/nc to the gmond XML port) b) in the XML Stream from gmetad (telnet/nc to the gmetad XML port) c) only in the web-frontend Cheers Martin --- Vitaly Karasik [EMAIL PROTECTED] wrote: NON-BUSY HOST: # ps axl|wc 61 8625865 # uptime 08:54:55 up 204 days, 2:00, 1 user, load average: 0.00, 0.00, 0.00 BUSY HOST ]# ps axl|wc 62 8775977 ]# uptime 08:55:18 up 31 days, 16:30, 1 user, load average: 0.04, 0.01, 0.00 -Original Message- From: Martin Knoblauch [mailto:[EMAIL PROTECTED] Sent: Thursday, January 11, 2007 10:54 AM To: Vitaly Karasik; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Two similar linux hosts provides different metrics Hi Vitaly, what does ps axl show on both hosts, as that is basically what gmond looks at? If it is already different there, the problem is not ganglia related. (OK, I see you already checked ...) What are the load averages according to uptime? Cheers Martin --- Vitaly Karasik [EMAIL PROTECTED] wrote: Hi, I have a weird problem - two linux hosts with similar configuration provide very different metrics about number of running processes - one shows about 2, and second about 20-40 (I speak about concentrated load graph at top right.) proc_total is different too - 171 vs. 350 (BTW, ps -ef |wc == 61 on both boxes) Both machines are RHEL3 kernel 2.4.21-37.ELsmp with ganglia-gmond-3.0.3-1 installed from RPM. Any ideas? Thanks, Vitaly -- --- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforge CID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] PBS Queue visualisation
Adam, look at the report/compound graphs in web/graph.php They should basically do what you want. Cheers Martin --- Adam Gray [EMAIL PROTECTED] wrote: I'm running ganglia on a cluster managed with OpenPBS. I have made a few extra metrics for monitoring CPU temp and batch system jobs on each node. I was wondering how I could go about making a sort of cluster queue usage graph. Each queue would pile on top of each other the number of nodes it is using. E.g. if queue1 was using 24 of 124 available nodes, and queue2 was using 96, there would be a section at the bottom 20% and a different colored section on the next 75%, and the top 5% would be empty. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] XML error: no element found at 1
Ashutok, you need to do a query if you use port 8562 (the web interface does). What happens if you do telnet localhost 8561. That should give you the complete gmetad XML stream. Is the rrdroot directory writable to the owner of the gmetad process? It should belong to e.g. nobody. This is a common mistake. cheers Martin --- Ashutosh Mahajan [EMAIL PROTECTED] wrote: hello everyone, We are having problems installing ganglia version 3.0.4 with rrdtool-1.2.15. we can successfully do make, make install. gstat -a also seems to work. telnet localhost 8649 seems to throw out correct XML file. However, gmetad seems to be having some problems. telnet localhost 8652 seems to hang forever with the message: Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. if i access ganglia through the web, i get this message after a long long time: There was an error collecting ganglia data (192.168.1.1:8652): XML error: no element found at 1 rrd_rootdir also remains empty. what could be wrong? i can provide more details if necessary. thanks in advance. -- Regards Ashutosh www.lehigh.edu/~asm4 This message was sent using IMP, the Internet Messaging Program. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Two similar linux hosts provides different metrics
Vitaly, in this case try to run gmond with a debug level higher that 2. Maybe this sheds some light on it. Or, you could add debug statements to the proc_run_func and proc_total_func code. But: first of all show us the output of cat /proc/loadavg on both nodes. cheers Martin --- Vitaly Karasik [EMAIL PROTECTED] wrote: It seems like we have different numbers in gmond: HOST NAME=5.5.5.5 IP=5.5.5.5 REPORTED=1168934873 TN=2 TMAX=20 DMAX=0 LOCATION=unspecified GMOND_STARTED=1166534354 .. METRIC NAME=proc_total VAL=185 TYPE=uint32 UNITS= TN=229 TMAX=950 DMAX=0 SLOPE=both SOURCE=gmond/ .. METRIC NAME=proc_run VAL=0 TYPE=uint32 UNITS= TN=229 TMAX=950 DMAX=0 SLOPE=both SOURCE=gmond/ HOST NAME=5.5.5.6 IP=5.5.5.6 REPORTED=1168934871 TN=3 TMAX=20 DMAX=0 LOCATION=unspecified GMOND_STARTED=1166534349 METRIC NAME=proc_run VAL=15 TYPE=uint32 UNITS= TN=68 TMAX=950 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=proc_total VAL=439 TYPE=uint32 UNITS= TN=68 TMAX=950 DMAX=0 SLOPE=both SOURCE=gmond/ Thanks, Vitaly -Original Message- From: Martin Knoblauch [mailto:[EMAIL PROTECTED] Sent: Monday, January 15, 2007 12:30 PM To: Vitaly Karasik; ganglia-general@lists.sourceforge.net Subject: RE: [Ganglia-general] Two similar linux hosts provides different metrics Hi Vitaly, where do you see the invalid numbers: a) in the gmond XML Stream (telnet/nc to the gmond XML port) b) in the XML Stream from gmetad (telnet/nc to the gmetad XML port) c) only in the web-frontend Cheers Martin --- Vitaly Karasik [EMAIL PROTECTED] wrote: NON-BUSY HOST: # ps axl|wc 61 8625865 # uptime 08:54:55 up 204 days, 2:00, 1 user, load average: 0.00, 0.00, 0.00 BUSY HOST ]# ps axl|wc 62 8775977 ]# uptime 08:55:18 up 31 days, 16:30, 1 user, load average: 0.04, 0.01, 0.00 -Original Message- From: Martin Knoblauch [mailto:[EMAIL PROTECTED] Sent: Thursday, January 11, 2007 10:54 AM To: Vitaly Karasik; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Two similar linux hosts provides different metrics Hi Vitaly, what does ps axl show on both hosts, as that is basically what gmond looks at? If it is already different there, the problem is not ganglia related. (OK, I see you already checked ...) What are the load averages according to uptime? Cheers Martin --- Vitaly Karasik [EMAIL PROTECTED] wrote: Hi, I have a weird problem - two linux hosts with similar configuration provide very different metrics about number of running processes - one shows about 2, and second about 20-40 (I speak about concentrated load graph at top right.) proc_total is different too - 171 vs. 350 (BTW, ps -ef |wc == 61 on both boxes) Both machines are RHEL3 kernel 2.4.21-37.ELsmp with ganglia-gmond-3.0.3-1 installed from RPM. Any ideas? Thanks, Vitaly -- --- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforge CID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Two similar linux hosts provides different metrics
Vitaly, gmond on Linux just interprets the fourth filed of /proc/loadavg. The number in front of the slash is the number of running processes, the number following the slash is the total number of processes. Cheers Martin --- Vitaly Karasik [EMAIL PROTECTED] wrote: .5: cat /proc/loadavg 0.04 0.06 0.01 1/185 10512 .6: cat /proc/loadavg 1.03 1.01 1.00 2/441 19965 Oops! I think I'm starting to understand - number of processes on both machines are the same, but number the threads are different. probably gmond counts threads, not processes: .5: ps -ef|wc 64 ps -efm|wc 187 .6: ps -ef|wc 62 ps -efm|wc 441 -Original Message- From: Martin Knoblauch [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 16, 2007 11:59 AM To: Vitaly Karasik; [EMAIL PROTECTED]; ganglia-general@lists.sourceforge.net Subject: RE: [Ganglia-general] Two similar linux hosts provides different metrics Vitaly, in this case try to run gmond with a debug level higher that 2. Maybe this sheds some light on it. Or, you could add debug statements to the proc_run_func and proc_total_func code. But: first of all show us the output of cat /proc/loadavg on both nodes. cheers Martin --- Vitaly Karasik [EMAIL PROTECTED] wrote: It seems like we have different numbers in gmond: HOST NAME=5.5.5.5 IP=5.5.5.5 REPORTED=1168934873 TN=2 TMAX=20 DMAX=0 LOCATION=unspecified GMOND_STARTED=1166534354 .. METRIC NAME=proc_total VAL=185 TYPE=uint32 UNITS= TN=229 TMAX=950 DMAX=0 SLOPE=both SOURCE=gmond/ .. METRIC NAME=proc_run VAL=0 TYPE=uint32 UNITS= TN=229 TMAX=950 DMAX=0 SLOPE=both SOURCE=gmond/ HOST NAME=5.5.5.6 IP=5.5.5.6 REPORTED=1168934871 TN=3 TMAX=20 DMAX=0 LOCATION=unspecified GMOND_STARTED=1166534349 METRIC NAME=proc_run VAL=15 TYPE=uint32 UNITS= TN=68 TMAX=950 DMAX=0 SLOPE=both SOURCE=gmond/ METRIC NAME=proc_total VAL=439 TYPE=uint32 UNITS= TN=68 TMAX=950 DMAX=0 SLOPE=both SOURCE=gmond/ Thanks, Vitaly -Original Message- From: Martin Knoblauch [mailto:[EMAIL PROTECTED] Sent: Monday, January 15, 2007 12:30 PM To: Vitaly Karasik; ganglia-general@lists.sourceforge.net Subject: RE: [Ganglia-general] Two similar linux hosts provides different metrics Hi Vitaly, where do you see the invalid numbers: a) in the gmond XML Stream (telnet/nc to the gmond XML port) b) in the XML Stream from gmetad (telnet/nc to the gmetad XML port) c) only in the web-frontend Cheers Martin --- Vitaly Karasik [EMAIL PROTECTED] wrote: NON-BUSY HOST: # ps axl|wc 61 8625865 # uptime 08:54:55 up 204 days, 2:00, 1 user, load average: 0.00, 0.00, 0.00 BUSY HOST ]# ps axl|wc 62 8775977 ]# uptime 08:55:18 up 31 days, 16:30, 1 user, load average: 0.04, 0.01, 0.00 -Original Message- From: Martin Knoblauch [mailto:[EMAIL PROTECTED] Sent: Thursday, January 11, 2007 10:54 AM To: Vitaly Karasik; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Two similar linux hosts provides different metrics Hi Vitaly, what does ps axl show on both hosts, as that is basically what gmond looks at? If it is already different there, the problem is not ganglia related. (OK, I see you already checked ...) What are the load averages according to uptime? Cheers Martin --- Vitaly Karasik [EMAIL PROTECTED] wrote: Hi, I have a weird problem - two linux hosts with similar configuration provide very different metrics about number of running processes - one shows about 2, and second about 20-40 (I speak about concentrated load graph at top right.) proc_total is different too - 171 vs. 350 (BTW, ps -ef |wc == 61 on both boxes) Both machines are RHEL3 kernel 2.4.21-37.ELsmp with ganglia-gmond-3.0.3-1 installed from RPM. Any ideas? Thanks, Vitaly -- --- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforge CID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net
Re: [Ganglia-general] XML error: no element found at 1
Hi Ashutosh, sorry for the wrong port. I meant of course 8651. You could try to run gmetad with a high debug level. This could help to track down the problem. Also, could you please post the gmetad.conf file? Cheers Martin --- Ashutosh Mahajan [EMAIL PROTECTED] wrote: Quoting Martin Knoblauch [EMAIL PROTECTED]: Ashutok, you need to do a query if you use port 8562 (the web interface does). What happens if you do telnet localhost 8561. That should give you the complete gmetad XML stream. thanks for the prompt reply. you meant 8651, rather than 8561? [EMAIL PROTECTED] ~]$ telnet localhost 8651 Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. seems to hang forever there. Is the rrdroot directory writable to the owner of the gmetad process? It should belong to e.g. nobody. This is a common mistake. yeah. it is writable. cheers Martin --- Ashutosh Mahajan [EMAIL PROTECTED] wrote: hello everyone, We are having problems installing ganglia version 3.0.4 with rrdtool-1.2.15. we can successfully do make, make install. gstat -a also seems to work. telnet localhost 8649 seems to throw out correct XML file. However, gmetad seems to be having some problems. telnet localhost 8652 seems to hang forever with the message: Trying 127.0.0.1... Connected to localhost. Escape character is '^]'. if i access ganglia through the web, i get this message after a long long time: There was an error collecting ganglia data (192.168.1.1:8652): XML error: no element found at 1 rrd_rootdir also remains empty. what could be wrong? i can provide more details if necessary. thanks in advance. This message was sent using IMP, the Internet Messaging Program. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] mcast_ttl in 3.0 gmond.conf
--- Ian Cunningham [EMAIL PROTECTED] wrote: Gil, Gilad Raphaelli wrote: Hello, I'm having a problem increasing gmond's multicast packet ttl. I've tried putting mcast_ttl on a line of its own and inside the global { } and udp_send_channel {} directives and always get gmond.conf parsing errors when trying to start gmond-3.0.4. Any pointers on where mcast_ttl can be set? The error message is: gmond.conf:200: no such option 'mcast_ttl' Finally, mcast_ttl doesn't appear in gmond -t - has this functionality been removed altogether? Thanks, Gil I no longer use multicast so I not sure it works, but from looking at the source code, It looks like it was changed to 'ttl' under 'udp_send_channel'. which is even correctly documented in the shipping tarball. We should update the stuff on the weg-page though ... Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Help! I have a petabyte/s network
David, as far as I remember, the AIX metrics code had an overflow/wrap-around problem prior to 3.0.4. Maybe the fixes are not thorough enough. The packets/sec are of course less affected. Cheers Martin --- David Wong [EMAIL PROTECTED] wrote: Ganglia is reporting that I'm pushing up to 200 Petabytes/s through my network. Nobody tell the network admin! I'm running Ganglia 3.0.4 with the Power5 add-ons on AIX5.3 Bytes in and out statistics generally appear to have the right values. However at random times, I get spikes in the petabytes/s range. Here's a dump of the bytes_in database. At first, I suspected perhaps these coincide with some counters getting reset, but they don't occur at regular intervals. !-- 2007-03-27 20:42:00 GMT / 1175028120 -- rowv 1.9268390706e+05 /v/row !-- 2007-03-27 20:48:00 GMT / 1175028480 -- rowv 1.5833184975e+05 /v/row !-- 2007-03-27 20:54:00 GMT / 1175028840 -- rowv 1.6838302753e+05 /v/row !-- 2007-03-27 21:00:00 GMT / 1175029200 -- rowv 1.3766069592e+05 /v/row !-- 2007-03-27 21:06:00 GMT / 1175029560 -- rowv 2.1711888414e+05 /v/row !-- 2007-03-27 21:12:00 GMT / 1175029920 -- rowv 4.9959709273e+16 /v/row !-- 2007-03-27 21:18:00 GMT / 1175030280 -- rowv 1.7401339783e+05 /v/row !-- 2007-03-27 21:24:00 GMT / 1175030640 -- rowv 2.0955720861e+05 /v/row !-- 2007-03-27 21:30:00 GMT / 1175031000 -- rowv 1.9032255300e+05 /v/row !-- 2007-03-27 21:36:00 GMT / 1175031360 -- rowv 1.9162727036e+05 /v/row !-- 2007-03-27 21:42:00 GMT / 1175031720 -- rowv 1.2703790825e+05 /v/row Can anyone shed light on what might be happening? Any pointers for debugging? David Wong Senior Systems Engineer Management Dynamics, Inc. Phone: 201-804-6127 [EMAIL PROTECTED] - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Help! I have a petabyte/s network
David, good catch. I will have to look at it for a bit. Cheers Martin --- David Wong [EMAIL PROTECTED] wrote: I don't write much code nowadays, so I'm going to need a lot of help with this. I dug through the ganglia code and I found this interesting tidbit in libmetrics/aix/metrics.c which may be indicative of the problem. There's an assignment from cur_ninfo.ibytes to cur_net_stat.ibytes, but the types of the two variables are different. net_stat::ibytes is a double: struct net_stat{ double ipackets; double opackets; double ibytes; double obytes; } cur_net_stat; and we have *ninfo declared here: perfstat_netinterface_total_t ninfo[2],*last_ninfo, *cur_ninfo ; libperfstat.h has perfstat_netinterface_total_t::ibytes as u_longlong_t. Does this code try to do what I think it is doing, i.e. assign an unsigned 64 bit integer to a signed 64bit integer? I'm willing to test the code if someone who's more adept at coding and building will take on the challenge. It looks to me that the type mismatch will have to fixed in a few places, such as CALC_NETSTAT, and we'll have to add an unsigned long long to g_val_t too. Those are the ones I can see so far. David Wong Senior Systems Engineer Management Dynamics, Inc. Phone: 201-804-6127 [EMAIL PROTECTED] -Original Message- From: Martin Knoblauch [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 28, 2007 12:00 PM To: David Wong; ganglia-general@lists.sourceforge.net Subject: Re: [Ganglia-general] Help! I have a petabyte/s network David, as far as I remember, the AIX metrics code had an overflow/wrap-around problem prior to 3.0.4. Maybe the fixes are not thorough enough. The packets/sec are of course less affected. Cheers Martin --- David Wong [EMAIL PROTECTED] wrote: Ganglia is reporting that I'm pushing up to 200 Petabytes/s through my network. Nobody tell the network admin! I'm running Ganglia 3.0.4 with the Power5 add-ons on AIX5.3 Bytes in and out statistics generally appear to have the right values. However at random times, I get spikes in the petabytes/s range. Here's a dump of the bytes_in database. At first, I suspected perhaps these coincide with some counters getting reset, but they don't occur at regular intervals. !-- 2007-03-27 20:42:00 GMT / 1175028120 -- rowv 1.9268390706e+05 /v/row !-- 2007-03-27 20:48:00 GMT / 1175028480 -- rowv 1.5833184975e+05 /v/row !-- 2007-03-27 20:54:00 GMT / 1175028840 -- rowv 1.6838302753e+05 /v/row !-- 2007-03-27 21:00:00 GMT / 1175029200 -- rowv 1.3766069592e+05 /v/row !-- 2007-03-27 21:06:00 GMT / 1175029560 -- rowv 2.1711888414e+05 /v/row !-- 2007-03-27 21:12:00 GMT / 1175029920 -- rowv 4.9959709273e+16 /v/row !-- 2007-03-27 21:18:00 GMT / 1175030280 -- rowv 1.7401339783e+05 /v/row !-- 2007-03-27 21:24:00 GMT / 1175030640 -- rowv 2.0955720861e+05 /v/row !-- 2007-03-27 21:30:00 GMT / 1175031000 -- rowv 1.9032255300e+05 /v/row !-- 2007-03-27 21:36:00 GMT / 1175031360 -- rowv 1.9162727036e+05 /v/row !-- 2007-03-27 21:42:00 GMT / 1175031720 -- rowv 1.2703790825e+05 /v/row Can anyone shed light on what might be happening? Any pointers for debugging? David Wong Senior Systems Engineer Management Dynamics, Inc. Phone: 201-804-6127 [EMAIL PROTECTED] - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDE V ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b
Re: [Ganglia-general] gmetad patch to contact random data_source hosts
Tim, your diff command looks a bit surprising to me. The revision number looks like CVS to me and we are SVN since quite some time. Which version of Ganglia have you checked out? Cheers Martin --- Witham, Timothy D [EMAIL PROTECTED] wrote: Hi, I just had a situation where the first host in a gmetad data_source accepts the connection but offers no data, like this: poll() timeout for [clustername] data source after 0 bytes read Gmetad always tries the sources in order and so it just keeps getting stuck on this first one, and losing the data for the entire cluster. Here is a quick patch that tries random hosts from the list instead, and solved my problem. It is not careful to make sure it tried them all, but if it fails it will just try again next time. If someone wants to fix it to try all the sources in a random order, that would be fine. Perhaps this could be included in the next release unless someone knows a good reason to always try the sources in order. Thanks! -8- diff -c -r1.1.1.1 data_thread.c *** data_thread.c 19 Mar 2007 18:52:32 - 1.1.1.1 --- data_thread.c 28 Mar 2007 18:12:08 - *** *** 18,24 void * data_thread ( void *arg ) { !int i, sleep_time, bytes_read, rval; data_source_list_t *d = (data_source_list_t *)arg; g_inet_addr *addr; g_tcp_socket *sock=0; --- 18,24 void * data_thread ( void *arg ) { !int i, j, sleep_time, bytes_read, rval; data_source_list_t *d = (data_source_list_t *)arg; g_inet_addr *addr; g_tcp_socket *sock=0; *** *** 60,75 if(d-last_good_index = 0) sock = g_tcp_socket_new ( d-sources[d-last_good_index] ); ! /* If there was no good connection last time or the above connect failed then try each host in the list. */ if(!sock) { ! for(i=0; i d-num_sources; i++) { ! /* Find first viable source in list. */ ! sock = g_tcp_socket_new ( d-sources[i] ); if( sock ) { ! d-last_good_index = i; break; } } --- 60,80 if(d-last_good_index = 0) sock = g_tcp_socket_new ( d-sources[d-last_good_index] ); ! /* If there was no good connection last time or the above ! connect failed then try random hosts in the list. We try ! random ones in case someone is accepting the connection ! but refusing to provide any data; we don't want to get ! stuck with a non-working host. */ if(!sock) { ! for(i=0; i d-num_sources * 2; i++) { ! /* Find random viable source in list. */ ! j = d-num_sources * (rand() / (RAND_MAX - 1.0)); ! sock = g_tcp_socket_new ( d-sources[j] ); if( sock ) { ! d-last_good_index = j; break; } } -8-- -- [EMAIL PROTECTED]; I don't speak for Intel or anyone. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
Re: [Ganglia-general] Gmetad and web frontend on different machines.
Richard, depending on the cluster size, writing the RRDs via NFS might turn out to be a huge bottleneck. Cheers Martin --- [EMAIL PROTECTED] wrote: Saundry, It sort of looks like you can, but actually you can't. gmetad writes to rrd databases as local files, and the web and php read rrd databases as local (actually it invokes rrdtool itself). I imagine you could separate the two using NFS filessystems, but I have not tried this. kind regards, Richard Grevis Production Architecture Barclays Capital, Canary Wharf, London, E14 4BB -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of saundrya mishra Sent: 29 March 2007 14:30 To: ganglia-general@lists.sourceforge.net Subject: [Ganglia-general] Gmetad and web frontend on different machines. Hi There, I am new to Ganglia. Can we have gmetad and web frontend for a cluster to be running on two different machines?? If yes, then how is it possible since i read in the configuration file of the web frontend that the RRDTool databases need to be local to be read? Greetings, Saundrya. For more information about Barclays Capital, please visit our web site at http://www.barcap.com. Internet communications are not secure and therefore the Barclays Group does not accept legal responsibility for the contents of this message. Although the Barclays Group operates anti-virus programmes, it does not accept responsibility for any damage whatsoever that is caused by viruses being passed. Any views or opinions presented are solely those of the author and do not necessarily represent those of the Barclays Group. Replies to this email may be monitored by the Barclays Group for operational or business reasons. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de