Thank you Devon and Vladimir for starting this thread.  We (AddThis) 
have been struggling with gmetad performance and stability for a while 
and I'm personally excited to see the focus here.  I'll explain briefly 
how we are using ganglia for context and then have inline comments.

We have two data centers each with over a dozen clusters.  Each of these 
clusters are polled by a gmetad that does not store any rrds (only used 
for alerting) and one that writes out rrds & satisfies queries gweb. 
There is also a "grid" gmetad that polls DC1 and DC2's gmetad and writes 
out a copy of RRDs for all data centers at somewhat less granularity. 
This is used instead of federation so when a DC goes down we still have 
access to it's metrics so we can figure out what happened.  I believe 
this is called "non-scalable" mode or something like that.

Each "grid" gmetad is also forwarding all of it's metrics to a graphite 
instance.  Between the two DCs we currently have about 620k metrics. 
Our primary problems are around gmetad hanging and no longer polling a 
cluster, and gaps in graphs (particularly by the time the metrics get to 
graphite).  Since the gaps show up on large monitors with dashboards 
they cause a lot of complaints.

On 12/06/2013 03:36 PM, Vladimir Vuksan wrote:
> The Ganglia core is comprised of two daemons, `gmond` and `gmetad`. `Gmond` is
> primarily responsible for sending and receiving metrics; `gmetad` carries the
> hefty task of summarizing / aggregating the information, writing the metrics
> information to graphing utilities (such as RRD), and reporting summary metrics
> to the web front-end. Due to growth some metrics were never updating, and
> front-end response time was abysmal. These issues are tied directly to 
> `gmetad`.
>

Were these failures totally random or grouped in some way?  (Same 
cluster, type, etc).


> In our Ganglia setup, we run a `gmond` to collect data for every machine and
> several `gmetad` processes:
>
>    * An interactive `gmetad` process is responsible solely for reporting 
> summary
> statistics to the web interface.
>    * Another `gmetad` process is responsible for writing graphs.


Are these two gmetad process co-located on the same server?  I think 
this is an interesting option that I at least was not aware of.

Did you go with this setup to alleviate the problems described above or 
for other reasons?

> Initially, I spent a large amount of time using the `perf` utility to attempt 
> to
> find the bottleneck in the interactive `gmetad` service. I found that the hash
> table implementation in `gmetad` leaves a lot to be desired: apart from very
> poor behavior in the face of concurrency, it also is the wrong datastructure 
> to
> use for this purpose. Unfortunately, fixing this would require rewriting large
> swaths of Ganglia, so this was punted. Instead, Vlad suggested we simply 
> reduce
> the number of summarized metrics by explicitly stating which metrics are 
> summarized.

I strongly suspect a lot of blame should fall on the hash table.  I 
think it's likely the cause of the hangs we have observed in 
https://github.com/ganglia/monitor-core/issues/47 and Daniel Pocock has 
seen problems with it as far back as 2009.

Most of my experience is with python and java which have fancy abstract 
base classes, collection hierarchies, and whatnot.  Even  though a hash 
table does not have the best properties, wouldn't it be relativity easy 
to drop in a better one?


> This improved the performance of the interactive process (and thus of the web
> interface), but didn't address other issues: graphs still weren't updating
> properly (or at all, in some cases). Running `perf` on the graphing `gmetad`
> process revealed that the issue was largely one of serialization: although we
> had thought we had configured `gmetad` to use `rrdcached` to improve caching
> performance, the way that Ganglia calls librrd doesn't actually end up using
> rrdcached -- `gmetad` was writing directly to disk every time, forcing us to
> spin in the kernel. Additionally, librrd isn't thread-safe (and its 
> thread-safe
> API is broken). All calls to the RRD API are serialized, and each call to 
> create
> or update not only hit disk, but prevented any other thread from calling 
> create
> or update. We have 47 threads running at any time, all generally trying to 
> write
> data to an RRD file.

Frankly I don't understand rrdcached.  The OS already has a fancy 
substytem for keeping frequencly accessed data in memory.  If we are 
dealing with a lot of files (instead of a database with indexes where 
the applicaiton might have more information than the OS) why fight with 
it? (Canonical rant: https://www.varnish-cache.org/trac/wiki/ArchitectNotes)

Anyway we have had much better luck with tuning the page cache and 
disabling fsync for gmetad. 
http://oss.oetiker.ch/rrdtool-trac/wiki/TuningRRD  Adminitedly at least 
some of the problems we had with rrdcached could have been due to the 
issues you have identified.

> In the process of doing this, I noticed that ganglia used a particularly poor
> method for reading its XML metrics from gmond: It initialized a 1024-byte
> buffer, read into it, and if it would overflow, it would realloc the buffer 
> with
> an additional 1024 bytes and try reading again. When dealing with XML files 
> many
> megabytes in size, this caused many unnecessary reallocations. I modified this
> code to start with a 128KB buffer and double the buffer size when it runs out 
> of
> space. (I made a similar change to the code for decompressing gzip'ed data 
> that
> used a similar buffer sizing paradigm).

This sounds like a solid find.  I'm a little worried about the doubling 
though since as you said the responses can get quiet large.  Is there a 
max buffer size?

Does your fix also handle the case of gmetad polling other gmetad?

> After all these changes, both the interactive and RRD-writing processes spend
> most of their time in the hash table. I can continue improving Ganglia
> performance, but most of the low hanging fruit is now gone; at some me point 
> it
> will require:
>
>    * writing a version of librrd (this probably also means changing the rrd 
> file
> format),

I think something got cut off here.

>    * replacing the hash table in Ganglia with one that performs better,

I enthusiastically embrace this change!

>    * changing the data serialization format from XML to one that is easier /
> faster to parse,

A common request is to support multiple formats (json).  I admit I'm a 
little surprised that the actual parsing is a significant cost relative 
to all of the other work that has to be done.

>    * using a different data structure than a hash table for metrics 
> hierarchies
> (probably a tree with metrics stored at each level in contiguous memory and an
> index describing each metric at each level)

As you said this is a large change but likely a very beneficial one.  I 
think it would be particular interesting if we could pre-generate 
indexes that would be useful to gweb.

Down the road a new data structure might also make it easier to support 
keeping the last n data points so that we didn't have to worry about the 
polling time interval dance so much.  That's more of a new feature than 
directly relevant to these performance issues though.

>    * refactoring gmetad and gmond into a single process that shares memory
>

I'm not sure I folow this one.  While the node with gmetad likely also 
has gmond, gmond typically runs alone.  The local gmond is also not 
necessarily reporting directly to the co-located gmetad.


Thanks you get Devon for digging into this and I'm excited to try out 
some of the changes.


------------------------------------------------------------------------------
Sponsored by Intel(R) XDK 
Develop, test and display web and hybrid apps with a single code base.
Download it for free now!
http://pubads.g.doubleclick.net/gampad/clk?id=111408631&iu=/4140/ostg.clktrk
_______________________________________________
Ganglia-developers mailing list
Ganglia-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ganglia-developers

Reply via email to