[Ganglia-general] Views: self defindet cluster-Reports in json in views

2012-09-17 Thread Jochen Hein

Hello,

we run ganglia on a couple of AIX systems - each physical box is a
cluster in ganglia.
Each LPAR is a host. Some LPARs are in CPU pools, each physical box can
have none, one,
or more CPU pools.

A json-report for a pool looks like this:

{
 report_name : box1_pool_0_report,
 report_type : standard,
 title : pool 0 report,
 vertical_label : CPU Uses,
 series : [
  { hostname: lpar1, clustername: box1, metric: cpu_used,
color: 00ff00, label: lpar1,
 type: stack },
  { hostname: lpar2, clustername: box1, metric: cpu_used,
color: ff, label: lpar2,
 type: stack },
  { hostname: lpar1, clustername: box1, metric:
cpu_in_pool, color: 00, label: CPU in Pool
, line_width: 2, type: line }
 ]
}

When display in the cluser context, the title has the cluster name
displayed.

Since we have lots of boxes and lots of LPARS, I tried to add all
pool-Reports into a view:

{view_name:Pools,
items:[
{graph:box1_pool_0_report},
{graph:box1_pool_1_report},
{graph:box2_pool_0_report},
{graph:box2_pool_1_report},
...
],view_type:standard}

When I display this view, the title only talks about Grid and the pool
nr report
from the report definition.
It would be nice to have the cluster name displayed in the view, but I
don't like
to have it displayed twice in the cluser context. Any idea how this can be
solved?

Thanks in advance
Jochen






--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Impact of gmond polling on data collection

2012-09-17 Thread Nicholas Satterly
Hi Chris,

I've discovered there are two contributing factors to problems like this.

1. the number of metrics being sent (possibly in short bursts) can overflow
the UDP receive buffer.
2. the time it takes to process metrics in the UDP receive buffer causes
TCP connections from the gmetad's to timeout (currently hard-coded to 10
seconds)

In your case, you are probably dropping UDP packets because gmond can't
keep up. Gmond was enhanced to allow you to increase the UDP buffer size
back in April. I suggest you upgrade to the latest version and set this a
sensible value for your environment.

udp_recv_channel {
  port = 1234
  buffer = 1024000
}

To determine what is sensible is a bit of trial and error. Run netstat
-su and keep increasing the value until you no longer see the number of
packet receive errors going up.

$ netstat -su
Udp:
7941393 packets received
23 packets to unknown port received.
0 packet receive errors
10079118 packets sent

The other possibility is that it takes so long for a gmetad to pull back
all the metrics you are collecting for a cluster that you are preventing
the gmond from processing metric data received via UDP. Again this can
cause the UDP receive buffer to overflow.

The problem we had at my work is related to all of the above but manifested
itself in a slightly different way. We were seeing gaps in all our graphs
because at times none of the servers in a cluster would respond to gmetad
poll within 10 seconds. I used to think that the gmond was completely hung
but realised that they would respond normally most of the time but every
minute or so it woul take about 20-25 seconds. This happened to coincide
with the UDP receive queue growing (Recv-Q column below) and I realised
that it took this long for the gmond to process the metric data it had
received via UDP from all the other servers in the cluster.

$ netstat -ua
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address
udp   1920032  0 *:8649  *:*


The solution was to modify gmond and move the TCP request handler into to
separate thread so that gmond could take as long as it needed to process
incoming metric data (from UDP receive buffer that is large enough not to
overflow) without blocking on the TCP requests for the XML data.

The patched gmond is running without a problem in our environment so I have
submitted a pull request[1] for it to be included in trunk.

I can't be 100% sure that this patch will fix your problem but it would be
worth a try.

Regards,
Nick

[1] https://github.com/ganglia/monitor-core/pull/50

On Sat, Sep 15, 2012 at 12:16 AM, Chris Burroughs chris.burrou...@gmail.com
 wrote:

 We use ganglia to monitor  500 hosts in multiple datacenters with about
 90k unique host:metric pairs per DC.  We use this data for all of the
 cool graphs in the web UI and for passive alerting.

 One of our checks is to measure TN of load_one on every box (we want to
 make sure gmond is working and correctly updating metrics otherwise we
 could be blind and not know it).  We consider it a failure if TN is 
 600.  This is an arbitrary number but 10 minutes seemed plenty long.

 Unfortunately we are seeing this check fail far too often.  We set up
 two parallel gmetad instances (monitoring identical gmonds) per DC and
 have broken our problem into two classes:
  * (A) only one of the gmetad stops updating for an entire cluster, and
 must be restarted to recover.  Since the gmetad's disagree we know the
 problem is there. [1]
  * (B) Both gmetad's say an individual host has not reported (gmond
 aggregation or sending must be at fault).  This issue is usually
 transient (that is it recovers after some period of time greater than 10
 minutes).

 While attempting to reproduce (A) we ran several additional gmetad
 instances (again polling the same gmonds) around 2012-12-07.  Failures
 per day are below [2].  The act of testing seems to have significantly
 increased the number of failures.

 This lead us to consider if the act of polling a gmond aggregator could
 impact the ability for it to concurrently collect metrics.  We looked at
 the code but are not experienced with concurrent programming in C.
 Could someone with more familiarity with the gmond code comment as to if
 this is likely  to be a worthwhile avenue of investigation?  We are also
 looking to for suggestion for an empirical test to rule this out.

 (Of course, other comments on the root TN goes up, metrics stop
 updating sporadic problem are also welcome!)

 Thank you,
 Chris Burroughs


 [1] https://github.com/ganglia/monitor-core/issues/47

 [2]
 120827  89
 120828  6
 120829  3
 120830  4
 120831  5
 120901  1
 120902  6
 120903  2
 120904  9
 120905  4
 120906  70
 120907  523
 120908  85
 120909  4
 120910  6
 120911  2
 120912  5
 120913  5


 --
 Got visibility?
 Most devs has no idea what their production app 

Re: [Ganglia-general] Java/JMX plugin for Ganglia 3.1.x

2012-09-17 Thread Martin Knoblauch
Hi Daniel,

 JMXetric is one of the options I am considering. The other is JMXtrans. Both 
are now using gmetric4j.

- JMXetric has the advantage that I can instrument the tomcat directly and send 
to the local gmond, without any spoofing. The disadvantage is that it changes 
the application and needs a lot testing for productive use
- JMXtrans has the advantage that it is external to the application. The beauty 
is that one *could* have a central JMX aggregator which would spoof the data to 
the aggregating gmonds. Unfortunatelly there seems to be a prblem with 
spoofing, gmetric4j and the 3.1 wireformat. Seems this is just not supported. 
Alternatively one could of course run local JMXtrans instances on evers tomcat 
host. Not that nice ...

 Brings me back to my question at the developers list. What is the story of 
gmetric4j vs. spoofing.

Cheers

Martin 

--
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de



 From: Daniel Pocock dan...@pocock.com.au
To: ganglia-general@lists.sourceforge.net 
Sent: Sunday, September 16, 2012 8:51 PM
Subject: Re: [Ganglia-general] Java/JMX plugin for Ganglia 3.1.x
 


Have you looked at JMXetric?

The latest code is in the main community github now

  https://github.com/ganglia/jmxetric

It originated here:

  http://code.google.com/p/jmxetric/

but I have recently split the JMX stuff, so that non-JMX users can just
use it as gmetric4j.  So for JMX, you use gmetric4j + jmxetric together.




On 16/09/12 15:02, Martin Knoblauch wrote:
 Hi Peter,
 
  thanks. Unfortunatelly due to the situation at the customer ite I am bound 
to 3.1.x. But I will remember this.
 
 Cheers
 
 Martin 
 
 --
 Martin Knoblauch
 email: k n o b i AT knobisoft DOT de
 www:  http://www.knobisoft.de
 
 
 
 From: Peter Phaal peter.ph...@gmail.com
 To: Martin Knoblauch kn...@knobisoft.de 
 Cc: ganglia general ganglia-general@lists.sourceforge.net 
 Sent: Saturday, September 15, 2012 12:57 AM
 Subject: Re: [Ganglia-general] Java/JMX plugin for Ganglia 3.1.x

 Martin,

 If you can upgrade to the latest Ganglia release you could use sFlow
 to monitor your Tomcat servers, the jxm-sflow-agent exports standard
 JVM metrics, or the tomcat-sflow-valve can export the JVM metrics as
 well as HTTP counters and transactions.

 http://host-sflow.sourceforge.net/relatedlinks.php

 Cheers,
 Peter

 On Thu, Sep 13, 2012 at 5:43 AM, Martin Knoblauch kn...@knobisoft.de 
 wrote:
 Hi,

   as part of a larger tomcat deployment I need to monitor several tomcat
 instances and want to add the measured data to a Ganglia setup. I already
 found JMXtrans which seems a cool solution, but it uses host spoofing and
 I am not sure it is what I really want. Needs some real investigating.

   What I would love would to have would be a Gmond plugin that just can add
 the measured metric to the system metrics. Has anybody already done such a
 plugin or is working on it? I could provide testing, feedback and maybe
 help.

 Cheers
 Martin
 --
 Martin Knoblauch
 email: k n o b i AT knobisoft DOT de
 www: http://www.knobisoft.de

 --
 Live Security Virtual Conference
 Exclusive live event will cover all the ways today's security and
 threat landscape has changed and how IT managers can respond. Discussions
 will include endpoint security, mobile security and the latest in malware
 threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general






 --
 Everyone hates slow websites. So do we.
 Make your web apps faster with AppDynamics
 Download AppDynamics Lite for free today:
 http://ad.doubleclick.net/clk;258768047;13503038;j?
 http://info.appdynamics.com/FreeJavaPerformanceDownload.html


 ___
 Ganglia-general mailing list
 Ganglia-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/ganglia-general


--
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://ad.doubleclick.net/clk;258768047;13503038;j?
http://info.appdynamics.com/FreeJavaPerformanceDownload.html
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


--
Live Security Virtual 

[Ganglia-general] gmetad collecting from other gmetad's

2012-09-17 Thread Chris Jones

  
  

I've done lots of googling and seen documentation like this
  that speaks on how to do it:
  
  Assuming "gmetad1" is the one with
"Webserver + Frontend" and
"gmetad2" is the other one with IP 192.168.100.100. gmetad1's
gmetad.conf should look like:

data_source "gmetad2" 192.168.100.100:8651

Make sure gmetad2's gmetad.conf has "gmetad1" in
"trusted_hosts",
otherwise "gmetad1" won't be able to poll "gmetad2".

That's all you should need to configure federation. Also make
sure
you take care of any firewall rules that might block
communications
between the hosts.
  But I am not able to get this to work myself. Is there any
  "official" documentation on how to get this to work that can be
  pointed out to me? It seems like my configuration is setup
  right. Here's why I say that. I've got two ganglia 'grids' -
  gmetad's running on two systems. Each one has multiple 'clusters'
  that it monitors... and each cluster group (on average 6
  systems... ) are running on a different port. Let's call my
  grid's primary and secondary. My primary grid is what I want to
  display (in the ganglia web page) the secondary grid. My
  assumption then would be when I click on the primary grid's web
  page I'd see two grids listed... and that I could drill down into
  either one.
  
  So on my secondary grid, I tweaked the gmetad.conf file in this
  way:
  
  # The port gmetad will answer requests for XML
  # default: 8651
  ###xml_port 8651
  xml_port 8655
  
  and I added in the 'trusted_hosts' line the ip of my primary
  grid. 
  
  Then on my primary grid's gmetad.conf file I did this:
  
  data_source "secondary" secondary.ip.address:8655
  
  I then restarted the gmetad daemon on both the primary and
  secondary grid's. I can even from my primary grid's host do a
  'telnet secondary.ip.address 8655' and have it come back with all
  the cluster/host information for each host in the secondary
  grid... including all the xml metrics. So it *sure* seems like my
  secondary grid is able to communicate to my primary grid. The
  question is what am I missing... because when I view the web
  interface for my primary grid, I still only see the primary grid's
  cluster information. Nothing's changed. I'm also not seeing
  anything logged in my /var/log/messages file that would indicate
  any errors. 
  
  I can't believe there isn't somebody out there who's successfully
  done this. I would also love to see a public ganglia web page
  that could show an example of this working too :) 
  
  Thanks in advance,
  -chris


-- 
Chris Jones
SSAI - ASDC Senior Systems Administrator

Note to self: Insert cool signature here.

  


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general


Re: [Ganglia-general] Impact of gmond polling on data collection

2012-09-17 Thread Peter Phaal
Nicholas,

It makes sense to multi-thread gmond, but looking at your patch, I
don't see any locking associated with the hosts hashtable. Isn't there
a possible race if new hosts/metrics are added to the hashtable by the
UDP thread at the same time the hashtable is being walked by the TCP
thread?

Peter

On Mon, Sep 17, 2012 at 6:03 AM, Nicholas Satterly nfsatte...@gmail.com wrote:
 Hi Chris,

 I've discovered there are two contributing factors to problems like this.

 1. the number of metrics being sent (possibly in short bursts) can overflow
 the UDP receive buffer.
 2. the time it takes to process metrics in the UDP receive buffer causes TCP
 connections from the gmetad's to timeout (currently hard-coded to 10
 seconds)

 In your case, you are probably dropping UDP packets because gmond can't keep
 up. Gmond was enhanced to allow you to increase the UDP buffer size back in
 April. I suggest you upgrade to the latest version and set this a sensible
 value for your environment.

 udp_recv_channel {
   port = 1234
   buffer = 1024000
 }

 To determine what is sensible is a bit of trial and error. Run netstat -su
 and keep increasing the value until you no longer see the number of packet
 receive errors going up.

 $ netstat -su
 Udp:
 7941393 packets received
 23 packets to unknown port received.
 0 packet receive errors
 10079118 packets sent

 The other possibility is that it takes so long for a gmetad to pull back all
 the metrics you are collecting for a cluster that you are preventing the
 gmond from processing metric data received via UDP. Again this can cause the
 UDP receive buffer to overflow.

 The problem we had at my work is related to all of the above but manifested
 itself in a slightly different way. We were seeing gaps in all our graphs
 because at times none of the servers in a cluster would respond to gmetad
 poll within 10 seconds. I used to think that the gmond was completely hung
 but realised that they would respond normally most of the time but every
 minute or so it woul take about 20-25 seconds. This happened to coincide
 with the UDP receive queue growing (Recv-Q column below) and I realised
 that it took this long for the gmond to process the metric data it had
 received via UDP from all the other servers in the cluster.

 $ netstat -ua
 Active Internet connections (servers and established)
 Proto Recv-Q Send-Q Local Address
 udp   1920032  0 *:8649  *:*

 The solution was to modify gmond and move the TCP request handler into to
 separate thread so that gmond could take as long as it needed to process
 incoming metric data (from UDP receive buffer that is large enough not to
 overflow) without blocking on the TCP requests for the XML data.

 The patched gmond is running without a problem in our environment so I have
 submitted a pull request[1] for it to be included in trunk.

 I can't be 100% sure that this patch will fix your problem but it would be
 worth a try.

 Regards,
 Nick

 [1] https://github.com/ganglia/monitor-core/pull/50


 On Sat, Sep 15, 2012 at 12:16 AM, Chris Burroughs
 chris.burrou...@gmail.com wrote:

 We use ganglia to monitor  500 hosts in multiple datacenters with about
 90k unique host:metric pairs per DC.  We use this data for all of the
 cool graphs in the web UI and for passive alerting.

 One of our checks is to measure TN of load_one on every box (we want to
 make sure gmond is working and correctly updating metrics otherwise we
 could be blind and not know it).  We consider it a failure if TN is 
 600.  This is an arbitrary number but 10 minutes seemed plenty long.

 Unfortunately we are seeing this check fail far too often.  We set up
 two parallel gmetad instances (monitoring identical gmonds) per DC and
 have broken our problem into two classes:
  * (A) only one of the gmetad stops updating for an entire cluster, and
 must be restarted to recover.  Since the gmetad's disagree we know the
 problem is there. [1]
  * (B) Both gmetad's say an individual host has not reported (gmond
 aggregation or sending must be at fault).  This issue is usually
 transient (that is it recovers after some period of time greater than 10
 minutes).

 While attempting to reproduce (A) we ran several additional gmetad
 instances (again polling the same gmonds) around 2012-12-07.  Failures
 per day are below [2].  The act of testing seems to have significantly
 increased the number of failures.

 This lead us to consider if the act of polling a gmond aggregator could
 impact the ability for it to concurrently collect metrics.  We looked at
 the code but are not experienced with concurrent programming in C.
 Could someone with more familiarity with the gmond code comment as to if
 this is likely  to be a worthwhile avenue of investigation?  We are also
 looking to for suggestion for an empirical test to rule this out.

 (Of course, other comments on the root TN goes up, metrics stop
 updating sporadic problem are also welcome!)

 

[Ganglia-general] gmetad collecting from other gmetad's (take 2)

2012-09-17 Thread Chris Jones

(reading the digest, looks like my post had html in it... didn't mean 
for that.  hopefully this one will read better)

I've done lots of googling and seen documentation like this that speaks 
on how to do it:

  Assuming gmetad1 is the one with Webserver + Frontend and
  gmetad2 is the other one with IP 192.168.100.100.  gmetad1's
  gmetad.conf should look like:
 
  data_source gmetad2 192.168.100.100:8651
 
  Make sure gmetad2's gmetad.conf has gmetad1 in trusted_hosts,
  otherwise gmetad1 won't be able to poll gmetad2.
 
  That's all you should need to configure federation.  Also make sure
  you take care of any firewall rules that might block communications
  between the hosts.

But I am not able to get this to work myself.  Is there any official 
documentation on how to get this to work that can be pointed out to me?  
It seems like my configuration is setup right. Here's why I say that.  
I've got two ganglia 'grids' - gmetad's running on two systems.  Each 
one has multiple 'clusters' that it monitors... and each cluster group 
(on average 6 systems... ) are running on a different port.  Let's call 
my grid's primary and secondary.  My primary grid is what I want to 
display (in the ganglia web page) the secondary grid.  My assumption 
then would be when I click on the primary grid's web page I'd see two 
grids listed... and that I could drill down into either one.

So on my secondary grid, I tweaked the gmetad.conf file in this way:

# The port gmetad will answer requests for XML
# default: 8651
  ###xml_port 8651
  xml_port 8655

and I added in the 'trusted_hosts' line the ip of my primary grid.

Then on my primary grid's gmetad.conf file I did this:

data_source secondary secondary.ip.address:8655

I then restarted the gmetad daemon on both the primary and secondary 
grid's.  I can even from my primary grid's host do a 'telnet 
secondary.ip.address 8655' and have it come back with all the 
cluster/host information for each host in the secondary grid... 
including all the xml metrics.  So it *sure* seems like my secondary 
grid is able to communicate to my primary grid.  The question is what am 
I missing... because when I view the web interface for my primary grid, 
I still only see the primary grid's cluster information.  Nothing's 
changed.  I'm also not seeing anything logged in my /var/log/messages 
file that would indicate any errors.

I can't believe there isn't somebody out there who's successfully done 
this.  I would also love to see a public ganglia web page that could 
show an example of this working too :)

Thanks in advance,
-chris

-- 
Chris Jones
SSAI - ASDC Senior Systems Administrator

Note to self: Insert cool signature here.


--
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
___
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general