Jesper Frank Nemholt wrote:
Hi!

I have a couple of questions related to Ganglia.

I'll say!

I've read some of the Ganglia documentation but haven't tried installing it yet.

Oooh, there's a good start.

What I usually need is a tool that can tell me, and allow me to tell some service responsible who's usually the one asking, why some machine/application isn't performing well either currently or sometime in the past.

If you ever find this tool, be sure and post back to this list. I'm sure we'd all love to get our hands on it. :)

> To do this I need precise information about cpu, memory, disk
I/O, NICs, tape drives, cpu usage on a per process & user basis and I need to be able to zoom in on a short period and filter out unwanted data.

Here's the problem:

The current Ganglia DTD doesn't allow for hierarchical metrics. They're all flat. This makes collecting a table's worth of data (i.e. mounted partition statistics, running process statistics) problematic at best. I tried putting together something that hacks around it but it isn't pretty.

The developers list has been carrying talk about how to improve this in Ganglia 3 for some months now (check them SF archives). As always, the vaporware product addresses most of your needs. :)

In my tool I've done this by making the graphing completely dynamic (controlled by the user) and by storing data in a database. To allow gathering precise data and alot of it without creating any unwanted CPU overhead, I decided to pick existing operating system tools as performance agents. So on Tru64 I use /usr/sbin/collect, on Solaris I use SE Toolkit combined with a custom SE script I've made. The best would ofcourse be to develop my own kernel collectors, but lack of time & knowledge prevented me from that.

The fact that you're developing on Tru64 doesn't make it any easier.

[you'd be using Ganglia on the two platforms that I initially ported it to...]

Monitoring these things with shell scripts is out of the question due to CPU usage of such solutions. What I currently do with Collect and SE Toolkit is to run with a 5 minute interval. Rvery 5 minute these tools output average values for the last 5 minute and I then store these values in a central database.

gmetad stores data in the RRDs every 15 seconds, although that's a user-configurable interval.

I'e reduced the data amount by eliminating data that doesn't interest me such as processes without CPU usage, disks without I/O etc.

There's a "slope" attribute for each metric that notes whether the value has changed significantly. I don't think gmetad pays attention to this yet, although we were just talking about this on developers this or last week. It's a relatively minor code change...

Anyway, given my needs I'd like to know what I can do with Ganglia. First of all, I need much more data than the default operating system agents provide. I can see that I can provide such data with scripts feeding Ganglia and thus would be able to wrap output from a tool like Collect into Ganglia data, but what do I do with the RRDs when it's dynamic data ? Doing disk, user and process statistics I'll generate lots of data that change all the time which means RRDs will need to be dynamically created. Secondly, to view selected parts of such data, I assume I'd run into problems with the current Ganglia web interface & RRD, right ? I've previously tried using Cacti (the RRD web interface) and it's a no go. Way too time consuming to configure and maintain. I need a solution that can maintain itself and adjust itself to my everchanging systems (disks, NICs, CPUs, filesystems etc. are added and removed daily).

Normally I don't like to respond to this sort of thing and say "RTFM/S" (S == source), but the M's not that up-to-date and the code I contributed is grody.

RRD is a dynamic data format whose resolution degrades over time (in a user-configurable manner). It's attractive because of its fixed "database" size (yet you can still "zoom in" on it to a certain degree). It's ugly because librrdtool is non-reentrant and because each database is a separate file on the filesystem...

Hosts and metrics disappear from a cluster after a certain amount of time. The RRDs, however, remain. This will make very little difference to a modern storage device, though (it's, what, 300k per host on the metric-infused Linux platform?).

Another approach I've thought of is using the Ganglia agents & infrastructure, but then offload the data with a wrapper script into a database and then query the database like I do with my current solution.

I think you'd be best served by a combination approach. You may want to write a nice script/app that polls gmetad every five minutes, parses the relevant results, and inserts the results into a database. You could also configure it to poll the data sources directly but that would involve more work. The trouble is that gmetad gives you the whole XML dump every time - there's no query/response method involved that can trim that down. That kind of sucks in installations like mine where the entire XML feed is up to 3.6MB ... that's a lot of data to parse on every page load of the front-end.

Also, note that the Solaris platform code already has the necessary (spaghetti) code in it to query via the kstat and kvm interfaces.

Tru64 also has the code in place using whatever freaky crazy interface Digital uses. So, assuming you can actually *find* the right query targets, you can copy and paste the existing code and change a few key words here and there.

I think it's Tru64 where I really got the CPU calculation code to shine. It adds up to 100% every time! No matter how many procs there are!

[or was that IRIX?]

This would give me the nice infrastructure and at the same time allow me the visualisation I need, but still runs short of comprehensive kernel agents (I'd still have to make some ugly interfacing between custom collectors and Ganglia on each platform). Ofcourse I could also just joing the Ganglia development and focus on my needs :-)

Wasn't there just an article on Slashdot yesterday about bug-reporting etiquette that featured the line, "Open Source does not mean 'The developers will do my bidding'?"

Now, if you want to start writing multi-platform data source plug-ins for the next version ...

[the idea of data *store* plug-ins is also being bandied about - in other words, this "storing-data-in-RRDs" stuff would be moved to a plug-in, so you'd be able to configure a data-collecting node to store its data in a remote database or even unicast it to another node somewhere else, if you like... which some people apparently do for reasons which I don't really entirely understand ... ]

Hints, ideas, opinions on how to solve these things are welcome.

Two outta three ain't bad...

Good luck, whatever you end up doing. :)


Reply via email to