Re: [Ganglia-general] Various questions related to Ganglia

Steven Wagner Thu, 20 Mar 2003 10:20:53 -0800

Jesper Frank Nemholt wrote:

Hi!


I have a couple of questions related to Ganglia.


I'll say!

I've read some of the Ganglia documentation but haven't tried installing ityet.


Oooh, there's a good start.

What I usually need is a tool that can tell me, and allow me to tell someservice responsible who's usually the one asking, why somemachine/application isn't performing well either currently or sometime inthe past.

If you ever find this tool, be sure and post back to this list. I'm surewe'd all love to get our hands on it. :)


> To do this I need precise information about cpu, memory, disk

I/O, NICs, tape drives, cpu usage on a per process & user basis and I needto be able to zoom in on a short period and filter out unwanted data.


Here's the problem:

The current Ganglia DTD doesn't allow for hierarchical metrics. They'reall flat. This makes collecting a table's worth of data (i.e. mountedpartition statistics, running process statistics) problematic at best. Itried putting together something that hacks around it but it isn't pretty.

The developers list has been carrying talk about how to improve this inGanglia 3 for some months now (check them SF archives). As always, thevaporware product addresses most of your needs. :)

In my tool I've done this by making the graphing completely dynamic(controlled by the user) and by storing data in a database. To allowgathering precise data and alot of it without creating any unwanted CPUoverhead, I decided to pick existing operating system tools as performanceagents. So on Tru64 I use /usr/sbin/collect, on Solaris I use SE Toolkitcombined with a custom SE script I've made.The best would ofcourse be to develop my own kernel collectors, but lack oftime & knowledge prevented me from that.


The fact that you're developing on Tru64 doesn't make it any easier.

[you'd be using Ganglia on the two platforms that I initially ported it to...]

Monitoring these things with shell scripts is out of the question due toCPU usage of such solutions.What I currently do with Collect and SE Toolkit is to run with a 5 minuteinterval. Rvery 5 minute these tools output average values for the last 5minute and I then store these values in a central database.

gmetad stores data in the RRDs every 15 seconds, although that's auser-configurable interval.

I'e reduced the data amount by eliminating data that doesn't interest mesuch as processes without CPU usage, disks without I/O etc.

There's a "slope" attribute for each metric that notes whether the valuehas changed significantly. I don't think gmetad pays attention to thisyet, although we were just talking about this on developers this or lastweek. It's a relatively minor code change...

Anyway, given my needs I'd like to know what I can do with Ganglia. Firstof all, I need much more data than the default operating system agentsprovide. I can see that I can provide such data with scripts feedingGanglia and thus would be able to wrap output from a tool like Collect intoGanglia data, but what do I do with the RRDs when it's dynamic data ?Doing disk, user and process statistics I'll generate lots of data thatchange all the time which means RRDs will need to be dynamically created.Secondly, to view selected parts of such data, I assume I'd run intoproblems with the current Ganglia web interface & RRD, right ?I've previously tried using Cacti (the RRD web interface) and it's a no go.Way too time consuming to configure and maintain. I need a solution thatcan maintain itself and adjust itself to my everchanging systems (disks,NICs, CPUs, filesystems etc. are added and removed daily).

Normally I don't like to respond to this sort of thing and say "RTFM/S" (S== source), but the M's not that up-to-date and the code I contributed isgrody.

RRD is a dynamic data format whose resolution degrades over time (in auser-configurable manner). It's attractive because of its fixed "database"size (yet you can still "zoom in" on it to a certain degree). It's uglybecause librrdtool is non-reentrant and because each database is a separatefile on the filesystem...

Hosts and metrics disappear from a cluster after a certain amount of time.The RRDs, however, remain. This will make very little difference to amodern storage device, though (it's, what, 300k per host on themetric-infused Linux platform?).

Another approach I've thought of is using the Ganglia agents &infrastructure, but then offload the data with a wrapper script into adatabase and then query the database like I do with my current solution.

I think you'd be best served by a combination approach. You may want towrite a nice script/app that polls gmetad every five minutes, parses therelevant results, and inserts the results into a database. You could alsoconfigure it to poll the data sources directly but that would involve morework. The trouble is that gmetad gives you the whole XML dump every time -there's no query/response method involved that can trim that down. Thatkind of sucks in installations like mine where the entire XML feed is up to3.6MB ... that's a lot of data to parse on every page load of the front-end.

Also, note that the Solaris platform code already has the necessary(spaghetti) code in it to query via the kstat and kvm interfaces.

Tru64 also has the code in place using whatever freaky crazy interfaceDigital uses. So, assuming you can actually *find* the right querytargets, you can copy and paste the existing code and change a few keywords here and there.

I think it's Tru64 where I really got the CPU calculation code to shine.It adds up to 100% every time! No matter how many procs there are!


[or was that IRIX?]

This would give me the nice infrastructure and at the same time allow methe visualisation I need, but still runs short of comprehensive kernelagents (I'd still have to make some ugly interfacing between customcollectors and Ganglia on each platform).Ofcourse I could also just joing the Ganglia development and focus on myneeds :-)

Wasn't there just an article on Slashdot yesterday about bug-reportingetiquette that featured the line, "Open Source does not mean 'Thedevelopers will do my bidding'?"

Now, if you want to start writing multi-platform data source plug-ins forthe next version ...

[the idea of data *store* plug-ins is also being bandied about - in otherwords, this "storing-data-in-RRDs" stuff would be moved to a plug-in, soyou'd be able to configure a data-collecting node to store its data in aremote database or even unicast it to another node somewhere else, if youlike... which some people apparently do for reasons which I don't reallyentirely understand ... ]

Hints, ideas, opinions on how to solve these things are welcome.


Two outta three ain't bad...

Good luck, whatever you end up doing. :)

Re: [Ganglia-general] Various questions related to Ganglia

Reply via email to