Bryan,

Since you want each of the nodes in the cluster to have access to the state its 
peers, implementing a full gmond equivalent peer sounds like the right call. 
However, I think that you might want to consider adding sFlow export 
functionality as well. It's helpful to have a clear understanding of the goals 
and architectural choices in sFlow.

The sFlow architecture is asymmetric with agents sending but never receiving 
data. Once you have made that choice, you can further simplify the agent by 
making it stateless - for example, you will see that sFlow exports raw counters 
and leaves it up to the receiver to compute deltas. With gmond the deltas are 
computed at the sender, requiring it to maintain state (which gmond is doing 
anyway when it receives metrics, so it isn't an unreasonable choice). Removing 
all state from the agent means that its memory requirements are minimal and it 
doesn't need to allocate memory - both properties are very useful when you want 
to embed the measurements in hardware devices like network switches.

As you point out, another difference is that sFlow exports standard sets of 
metrics rather than ad-hoc measurements. The benefit is that you can focus on 
optimizing the the collection of the standard metrics (even implementing some 
in hardware), tightly pack the data in a single datagram, eliminate the 
overhead of exchanging metadata and simplify multivendor monitoring since the 
same measurements will be sent by every device. Standardizing the metrics also 
helps reduce operational complexity - eliminating the configuration options 
that are needed for a more flexible solution.

A goal with sFlow is to instrument every switch port, server, virtual machine 
and service to provide a comprehensive view of performance across the data 
center. I think there would be great value in having bigdata export metrics so 
that they can be combined with data from network, load-balancer, web, memcache 
and application server tiers. 

It's also worth mentioning that sFlow doesn't just export counters. As an 
example, the sFlow Memcache metrics are probably most similar to the kinds of 
data you might want to export for bigdata. In addition to exporting a standard 
set of counters, the sFlow agent also randomly samples Memcache operations, 
exporting the command (GET,SET..), status (OK,ERROR,NOT_FOUND...), value size, 
and duration of the sampled operation. Random sampling is very lower overhead 
(about the cost of maintaining one counter) making it suitable for continuous 
monitoring of high transaction rate environments like a large Memcached 
cluster. The counters and the transaction samples complement one another. For 
example, you might be using Ganglia to track the cache hit rate using the sFlow 
counters and notice an increase in cache misses. Looking at the transaction 
samples you can identify the cluster-wide top missed keys - the information you 
need to actually fix the problem. In one case I am aware of, the misses were 
caused by a typo in a client side script and easily fixed - it's hard to see 
how you would easily spot this problem any other way.

In the web tier, sFlow agents sample HTTP operations and you might notice an 
increase in response time for a particular URL and trace it back to the missed 
key in the cache for example.

Getting back to bigdata - you could useful export the JVM metrics using sFlow - 
take a look at the jmx-sflow agent, or tomcat-sflow-valve for examples:
http://jmx-sflow-agent.googlecode.com/
http://tomcat-sflow-valve.googlecode.com/

There isn't much to the code, so you could easily incorporate it as an option 
in your java library.

There is currently an effort underway to generalize sFlow's application layer 
monitoring:

https://groups.google.com/forum/?fromgroups#!topic/sflow/e2sLb_3hyDI

I would be very interested in any comments you might have about the 
applicability to instrumenting bigdata transactions.

Cheers,
Peter

On Feb 3, 2012, at 10:19 AM, Bryan Thompson wrote:

> Peter,
> 
> I put together a ganglia listener / sending library in Java [1] which builds 
> up soft state in a concurrent hash map to support a ganglia integration for 
> bigdata [2].  The library makes it easy to turn a Java application into a 
> ganglia peer.  I also plan to migrate some of our existing per-host, 
> per-process, and JVM specific counters that we have into this library where 
> they might be useful to a broader audience.
> 
> Some of the benefits of this library for us are that we can:
> - leverage the existing ganglia ecosystem;
> - obtain fast load balanced reports from the soft state inside of the JVM; and
> - extend the metric collection and reporting trivially to application 
> specific counters.
> 
> I understand that sFlow is available for a variety of environments and that 
> it provides a tighter, though fixed, data gram encoding for metric messages.  
> Can you expand on whether sFlow might have been an alternative for the 
> integration that we did, if so, why I might want to use sFlow instead?  I am 
> just trying to get a better sense of where sFlow fits in the ganglia and Java 
> ecosystem.
> 
> Thanks,
> Bryan
> 
> [1] http://www.bigdata.com/bigdata/blog/?p=359
> [2] https://sourceforge.net/projects/bigdata/


------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2
_______________________________________________
Ganglia-general mailing list
Ganglia-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Reply via email to