[Ganglia-general] Views: self defindet cluster-Reports in json in views
Hello, we run ganglia on a couple of AIX systems - each physical box is a cluster in ganglia. Each LPAR is a host. Some LPARs are in CPU pools, each physical box can have none, one, or more CPU pools. A json-report for a pool looks like this: { report_name : box1_pool_0_report, report_type : standard, title : pool 0 report, vertical_label : CPU Uses, series : [ { hostname: lpar1, clustername: box1, metric: cpu_used, color: 00ff00, label: lpar1, type: stack }, { hostname: lpar2, clustername: box1, metric: cpu_used, color: ff, label: lpar2, type: stack }, { hostname: lpar1, clustername: box1, metric: cpu_in_pool, color: 00, label: CPU in Pool , line_width: 2, type: line } ] } When display in the cluser context, the title has the cluster name displayed. Since we have lots of boxes and lots of LPARS, I tried to add all pool-Reports into a view: {view_name:Pools, items:[ {graph:box1_pool_0_report}, {graph:box1_pool_1_report}, {graph:box2_pool_0_report}, {graph:box2_pool_1_report}, ... ],view_type:standard} When I display this view, the title only talks about Grid and the pool nr report from the report definition. It would be nice to have the cluster name displayed in the view, but I don't like to have it displayed twice in the cluser context. Any idea how this can be solved? Thanks in advance Jochen -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Impact of gmond polling on data collection
Hi Chris, I've discovered there are two contributing factors to problems like this. 1. the number of metrics being sent (possibly in short bursts) can overflow the UDP receive buffer. 2. the time it takes to process metrics in the UDP receive buffer causes TCP connections from the gmetad's to timeout (currently hard-coded to 10 seconds) In your case, you are probably dropping UDP packets because gmond can't keep up. Gmond was enhanced to allow you to increase the UDP buffer size back in April. I suggest you upgrade to the latest version and set this a sensible value for your environment. udp_recv_channel { port = 1234 buffer = 1024000 } To determine what is sensible is a bit of trial and error. Run netstat -su and keep increasing the value until you no longer see the number of packet receive errors going up. $ netstat -su Udp: 7941393 packets received 23 packets to unknown port received. 0 packet receive errors 10079118 packets sent The other possibility is that it takes so long for a gmetad to pull back all the metrics you are collecting for a cluster that you are preventing the gmond from processing metric data received via UDP. Again this can cause the UDP receive buffer to overflow. The problem we had at my work is related to all of the above but manifested itself in a slightly different way. We were seeing gaps in all our graphs because at times none of the servers in a cluster would respond to gmetad poll within 10 seconds. I used to think that the gmond was completely hung but realised that they would respond normally most of the time but every minute or so it woul take about 20-25 seconds. This happened to coincide with the UDP receive queue growing (Recv-Q column below) and I realised that it took this long for the gmond to process the metric data it had received via UDP from all the other servers in the cluster. $ netstat -ua Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address udp 1920032 0 *:8649 *:* The solution was to modify gmond and move the TCP request handler into to separate thread so that gmond could take as long as it needed to process incoming metric data (from UDP receive buffer that is large enough not to overflow) without blocking on the TCP requests for the XML data. The patched gmond is running without a problem in our environment so I have submitted a pull request[1] for it to be included in trunk. I can't be 100% sure that this patch will fix your problem but it would be worth a try. Regards, Nick [1] https://github.com/ganglia/monitor-core/pull/50 On Sat, Sep 15, 2012 at 12:16 AM, Chris Burroughs chris.burrou...@gmail.com wrote: We use ganglia to monitor 500 hosts in multiple datacenters with about 90k unique host:metric pairs per DC. We use this data for all of the cool graphs in the web UI and for passive alerting. One of our checks is to measure TN of load_one on every box (we want to make sure gmond is working and correctly updating metrics otherwise we could be blind and not know it). We consider it a failure if TN is 600. This is an arbitrary number but 10 minutes seemed plenty long. Unfortunately we are seeing this check fail far too often. We set up two parallel gmetad instances (monitoring identical gmonds) per DC and have broken our problem into two classes: * (A) only one of the gmetad stops updating for an entire cluster, and must be restarted to recover. Since the gmetad's disagree we know the problem is there. [1] * (B) Both gmetad's say an individual host has not reported (gmond aggregation or sending must be at fault). This issue is usually transient (that is it recovers after some period of time greater than 10 minutes). While attempting to reproduce (A) we ran several additional gmetad instances (again polling the same gmonds) around 2012-12-07. Failures per day are below [2]. The act of testing seems to have significantly increased the number of failures. This lead us to consider if the act of polling a gmond aggregator could impact the ability for it to concurrently collect metrics. We looked at the code but are not experienced with concurrent programming in C. Could someone with more familiarity with the gmond code comment as to if this is likely to be a worthwhile avenue of investigation? We are also looking to for suggestion for an empirical test to rule this out. (Of course, other comments on the root TN goes up, metrics stop updating sporadic problem are also welcome!) Thank you, Chris Burroughs [1] https://github.com/ganglia/monitor-core/issues/47 [2] 120827 89 120828 6 120829 3 120830 4 120831 5 120901 1 120902 6 120903 2 120904 9 120905 4 120906 70 120907 523 120908 85 120909 4 120910 6 120911 2 120912 5 120913 5 -- Got visibility? Most devs has no idea what their production app
Re: [Ganglia-general] Java/JMX plugin for Ganglia 3.1.x
Hi Daniel, JMXetric is one of the options I am considering. The other is JMXtrans. Both are now using gmetric4j. - JMXetric has the advantage that I can instrument the tomcat directly and send to the local gmond, without any spoofing. The disadvantage is that it changes the application and needs a lot testing for productive use - JMXtrans has the advantage that it is external to the application. The beauty is that one *could* have a central JMX aggregator which would spoof the data to the aggregating gmonds. Unfortunatelly there seems to be a prblem with spoofing, gmetric4j and the 3.1 wireformat. Seems this is just not supported. Alternatively one could of course run local JMXtrans instances on evers tomcat host. Not that nice ... Brings me back to my question at the developers list. What is the story of gmetric4j vs. spoofing. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de From: Daniel Pocock dan...@pocock.com.au To: ganglia-general@lists.sourceforge.net Sent: Sunday, September 16, 2012 8:51 PM Subject: Re: [Ganglia-general] Java/JMX plugin for Ganglia 3.1.x Have you looked at JMXetric? The latest code is in the main community github now https://github.com/ganglia/jmxetric It originated here: http://code.google.com/p/jmxetric/ but I have recently split the JMX stuff, so that non-JMX users can just use it as gmetric4j. So for JMX, you use gmetric4j + jmxetric together. On 16/09/12 15:02, Martin Knoblauch wrote: Hi Peter, thanks. Unfortunatelly due to the situation at the customer ite I am bound to 3.1.x. But I will remember this. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de From: Peter Phaal peter.ph...@gmail.com To: Martin Knoblauch kn...@knobisoft.de Cc: ganglia general ganglia-general@lists.sourceforge.net Sent: Saturday, September 15, 2012 12:57 AM Subject: Re: [Ganglia-general] Java/JMX plugin for Ganglia 3.1.x Martin, If you can upgrade to the latest Ganglia release you could use sFlow to monitor your Tomcat servers, the jxm-sflow-agent exports standard JVM metrics, or the tomcat-sflow-valve can export the JVM metrics as well as HTTP counters and transactions. http://host-sflow.sourceforge.net/relatedlinks.php Cheers, Peter On Thu, Sep 13, 2012 at 5:43 AM, Martin Knoblauch kn...@knobisoft.de wrote: Hi, as part of a larger tomcat deployment I need to monitor several tomcat instances and want to add the measured data to a Ganglia setup. I already found JMXtrans which seems a cool solution, but it uses host spoofing and I am not sure it is what I really want. Needs some real investigating. What I would love would to have would be a Gmond plugin that just can add the measured metric to the system metrics. Has anybody already done such a plugin or is working on it? I could provide testing, feedback and maybe help. Cheers Martin -- Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://ad.doubleclick.net/clk;258768047;13503038;j? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://ad.doubleclick.net/clk;258768047;13503038;j? http://info.appdynamics.com/FreeJavaPerformanceDownload.html ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general -- Live Security Virtual
[Ganglia-general] gmetad collecting from other gmetad's
I've done lots of googling and seen documentation like this that speaks on how to do it: Assuming "gmetad1" is the one with "Webserver + Frontend" and "gmetad2" is the other one with IP 192.168.100.100. gmetad1's gmetad.conf should look like: data_source "gmetad2" 192.168.100.100:8651 Make sure gmetad2's gmetad.conf has "gmetad1" in "trusted_hosts", otherwise "gmetad1" won't be able to poll "gmetad2". That's all you should need to configure federation. Also make sure you take care of any firewall rules that might block communications between the hosts. But I am not able to get this to work myself. Is there any "official" documentation on how to get this to work that can be pointed out to me? It seems like my configuration is setup right. Here's why I say that. I've got two ganglia 'grids' - gmetad's running on two systems. Each one has multiple 'clusters' that it monitors... and each cluster group (on average 6 systems... ) are running on a different port. Let's call my grid's primary and secondary. My primary grid is what I want to display (in the ganglia web page) the secondary grid. My assumption then would be when I click on the primary grid's web page I'd see two grids listed... and that I could drill down into either one. So on my secondary grid, I tweaked the gmetad.conf file in this way: # The port gmetad will answer requests for XML # default: 8651 ###xml_port 8651 xml_port 8655 and I added in the 'trusted_hosts' line the ip of my primary grid. Then on my primary grid's gmetad.conf file I did this: data_source "secondary" secondary.ip.address:8655 I then restarted the gmetad daemon on both the primary and secondary grid's. I can even from my primary grid's host do a 'telnet secondary.ip.address 8655' and have it come back with all the cluster/host information for each host in the secondary grid... including all the xml metrics. So it *sure* seems like my secondary grid is able to communicate to my primary grid. The question is what am I missing... because when I view the web interface for my primary grid, I still only see the primary grid's cluster information. Nothing's changed. I'm also not seeing anything logged in my /var/log/messages file that would indicate any errors. I can't believe there isn't somebody out there who's successfully done this. I would also love to see a public ganglia web page that could show an example of this working too :) Thanks in advance, -chris -- Chris Jones SSAI - ASDC Senior Systems Administrator Note to self: Insert cool signature here. -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general
Re: [Ganglia-general] Impact of gmond polling on data collection
Nicholas, It makes sense to multi-thread gmond, but looking at your patch, I don't see any locking associated with the hosts hashtable. Isn't there a possible race if new hosts/metrics are added to the hashtable by the UDP thread at the same time the hashtable is being walked by the TCP thread? Peter On Mon, Sep 17, 2012 at 6:03 AM, Nicholas Satterly nfsatte...@gmail.com wrote: Hi Chris, I've discovered there are two contributing factors to problems like this. 1. the number of metrics being sent (possibly in short bursts) can overflow the UDP receive buffer. 2. the time it takes to process metrics in the UDP receive buffer causes TCP connections from the gmetad's to timeout (currently hard-coded to 10 seconds) In your case, you are probably dropping UDP packets because gmond can't keep up. Gmond was enhanced to allow you to increase the UDP buffer size back in April. I suggest you upgrade to the latest version and set this a sensible value for your environment. udp_recv_channel { port = 1234 buffer = 1024000 } To determine what is sensible is a bit of trial and error. Run netstat -su and keep increasing the value until you no longer see the number of packet receive errors going up. $ netstat -su Udp: 7941393 packets received 23 packets to unknown port received. 0 packet receive errors 10079118 packets sent The other possibility is that it takes so long for a gmetad to pull back all the metrics you are collecting for a cluster that you are preventing the gmond from processing metric data received via UDP. Again this can cause the UDP receive buffer to overflow. The problem we had at my work is related to all of the above but manifested itself in a slightly different way. We were seeing gaps in all our graphs because at times none of the servers in a cluster would respond to gmetad poll within 10 seconds. I used to think that the gmond was completely hung but realised that they would respond normally most of the time but every minute or so it woul take about 20-25 seconds. This happened to coincide with the UDP receive queue growing (Recv-Q column below) and I realised that it took this long for the gmond to process the metric data it had received via UDP from all the other servers in the cluster. $ netstat -ua Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address udp 1920032 0 *:8649 *:* The solution was to modify gmond and move the TCP request handler into to separate thread so that gmond could take as long as it needed to process incoming metric data (from UDP receive buffer that is large enough not to overflow) without blocking on the TCP requests for the XML data. The patched gmond is running without a problem in our environment so I have submitted a pull request[1] for it to be included in trunk. I can't be 100% sure that this patch will fix your problem but it would be worth a try. Regards, Nick [1] https://github.com/ganglia/monitor-core/pull/50 On Sat, Sep 15, 2012 at 12:16 AM, Chris Burroughs chris.burrou...@gmail.com wrote: We use ganglia to monitor 500 hosts in multiple datacenters with about 90k unique host:metric pairs per DC. We use this data for all of the cool graphs in the web UI and for passive alerting. One of our checks is to measure TN of load_one on every box (we want to make sure gmond is working and correctly updating metrics otherwise we could be blind and not know it). We consider it a failure if TN is 600. This is an arbitrary number but 10 minutes seemed plenty long. Unfortunately we are seeing this check fail far too often. We set up two parallel gmetad instances (monitoring identical gmonds) per DC and have broken our problem into two classes: * (A) only one of the gmetad stops updating for an entire cluster, and must be restarted to recover. Since the gmetad's disagree we know the problem is there. [1] * (B) Both gmetad's say an individual host has not reported (gmond aggregation or sending must be at fault). This issue is usually transient (that is it recovers after some period of time greater than 10 minutes). While attempting to reproduce (A) we ran several additional gmetad instances (again polling the same gmonds) around 2012-12-07. Failures per day are below [2]. The act of testing seems to have significantly increased the number of failures. This lead us to consider if the act of polling a gmond aggregator could impact the ability for it to concurrently collect metrics. We looked at the code but are not experienced with concurrent programming in C. Could someone with more familiarity with the gmond code comment as to if this is likely to be a worthwhile avenue of investigation? We are also looking to for suggestion for an empirical test to rule this out. (Of course, other comments on the root TN goes up, metrics stop updating sporadic problem are also welcome!)
[Ganglia-general] gmetad collecting from other gmetad's (take 2)
(reading the digest, looks like my post had html in it... didn't mean for that. hopefully this one will read better) I've done lots of googling and seen documentation like this that speaks on how to do it: Assuming gmetad1 is the one with Webserver + Frontend and gmetad2 is the other one with IP 192.168.100.100. gmetad1's gmetad.conf should look like: data_source gmetad2 192.168.100.100:8651 Make sure gmetad2's gmetad.conf has gmetad1 in trusted_hosts, otherwise gmetad1 won't be able to poll gmetad2. That's all you should need to configure federation. Also make sure you take care of any firewall rules that might block communications between the hosts. But I am not able to get this to work myself. Is there any official documentation on how to get this to work that can be pointed out to me? It seems like my configuration is setup right. Here's why I say that. I've got two ganglia 'grids' - gmetad's running on two systems. Each one has multiple 'clusters' that it monitors... and each cluster group (on average 6 systems... ) are running on a different port. Let's call my grid's primary and secondary. My primary grid is what I want to display (in the ganglia web page) the secondary grid. My assumption then would be when I click on the primary grid's web page I'd see two grids listed... and that I could drill down into either one. So on my secondary grid, I tweaked the gmetad.conf file in this way: # The port gmetad will answer requests for XML # default: 8651 ###xml_port 8651 xml_port 8655 and I added in the 'trusted_hosts' line the ip of my primary grid. Then on my primary grid's gmetad.conf file I did this: data_source secondary secondary.ip.address:8655 I then restarted the gmetad daemon on both the primary and secondary grid's. I can even from my primary grid's host do a 'telnet secondary.ip.address 8655' and have it come back with all the cluster/host information for each host in the secondary grid... including all the xml metrics. So it *sure* seems like my secondary grid is able to communicate to my primary grid. The question is what am I missing... because when I view the web interface for my primary grid, I still only see the primary grid's cluster information. Nothing's changed. I'm also not seeing anything logged in my /var/log/messages file that would indicate any errors. I can't believe there isn't somebody out there who's successfully done this. I would also love to see a public ganglia web page that could show an example of this working too :) Thanks in advance, -chris -- Chris Jones SSAI - ASDC Senior Systems Administrator Note to self: Insert cool signature here. -- Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ ___ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general