Bryan, Since you want each of the nodes in the cluster to have access to the state its peers, implementing a full gmond equivalent peer sounds like the right call. However, I think that you might want to consider adding sFlow export functionality as well. It's helpful to have a clear understanding of the goals and architectural choices in sFlow.
The sFlow architecture is asymmetric with agents sending but never receiving data. Once you have made that choice, you can further simplify the agent by making it stateless - for example, you will see that sFlow exports raw counters and leaves it up to the receiver to compute deltas. With gmond the deltas are computed at the sender, requiring it to maintain state (which gmond is doing anyway when it receives metrics, so it isn't an unreasonable choice). Removing all state from the agent means that its memory requirements are minimal and it doesn't need to allocate memory - both properties are very useful when you want to embed the measurements in hardware devices like network switches. As you point out, another difference is that sFlow exports standard sets of metrics rather than ad-hoc measurements. The benefit is that you can focus on optimizing the the collection of the standard metrics (even implementing some in hardware), tightly pack the data in a single datagram, eliminate the overhead of exchanging metadata and simplify multivendor monitoring since the same measurements will be sent by every device. Standardizing the metrics also helps reduce operational complexity - eliminating the configuration options that are needed for a more flexible solution. A goal with sFlow is to instrument every switch port, server, virtual machine and service to provide a comprehensive view of performance across the data center. I think there would be great value in having bigdata export metrics so that they can be combined with data from network, load-balancer, web, memcache and application server tiers. It's also worth mentioning that sFlow doesn't just export counters. As an example, the sFlow Memcache metrics are probably most similar to the kinds of data you might want to export for bigdata. In addition to exporting a standard set of counters, the sFlow agent also randomly samples Memcache operations, exporting the command (GET,SET..), status (OK,ERROR,NOT_FOUND...), value size, and duration of the sampled operation. Random sampling is very lower overhead (about the cost of maintaining one counter) making it suitable for continuous monitoring of high transaction rate environments like a large Memcached cluster. The counters and the transaction samples complement one another. For example, you might be using Ganglia to track the cache hit rate using the sFlow counters and notice an increase in cache misses. Looking at the transaction samples you can identify the cluster-wide top missed keys - the information you need to actually fix the problem. In one case I am aware of, the misses were caused by a typo in a client side script and easily fixed - it's hard to see how you would easily spot this problem any other way. In the web tier, sFlow agents sample HTTP operations and you might notice an increase in response time for a particular URL and trace it back to the missed key in the cache for example. Getting back to bigdata - you could useful export the JVM metrics using sFlow - take a look at the jmx-sflow agent, or tomcat-sflow-valve for examples: http://jmx-sflow-agent.googlecode.com/ http://tomcat-sflow-valve.googlecode.com/ There isn't much to the code, so you could easily incorporate it as an option in your java library. There is currently an effort underway to generalize sFlow's application layer monitoring: https://groups.google.com/forum/?fromgroups#!topic/sflow/e2sLb_3hyDI I would be very interested in any comments you might have about the applicability to instrumenting bigdata transactions. Cheers, Peter On Feb 3, 2012, at 10:19 AM, Bryan Thompson wrote: > Peter, > > I put together a ganglia listener / sending library in Java [1] which builds > up soft state in a concurrent hash map to support a ganglia integration for > bigdata [2]. The library makes it easy to turn a Java application into a > ganglia peer. I also plan to migrate some of our existing per-host, > per-process, and JVM specific counters that we have into this library where > they might be useful to a broader audience. > > Some of the benefits of this library for us are that we can: > - leverage the existing ganglia ecosystem; > - obtain fast load balanced reports from the soft state inside of the JVM; and > - extend the metric collection and reporting trivially to application > specific counters. > > I understand that sFlow is available for a variety of environments and that > it provides a tighter, though fixed, data gram encoding for metric messages. > Can you expand on whether sFlow might have been an alternative for the > integration that we did, if so, why I might want to use sFlow instead? I am > just trying to get a better sense of where sFlow fits in the ganglia and Java > ecosystem. > > Thanks, > Bryan > > [1] http://www.bigdata.com/bigdata/blog/?p=359 > [2] https://sourceforge.net/projects/bigdata/ ------------------------------------------------------------------------------ Try before you buy = See our experts in action! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-dev2 _______________________________________________ Ganglia-general mailing list Ganglia-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/ganglia-general