Jesper Frank Nemholt wrote:
Hi!
I have a couple of questions related to Ganglia.
I'll say!
I've read some of the Ganglia documentation but haven't tried installing it
yet.
Oooh, there's a good start.
What I usually need is a tool that can tell me, and allow me to tell some
service responsible who's usually the one asking, why some
machine/application isn't performing well either currently or sometime in
the past.
If you ever find this tool, be sure and post back to this list. I'm sure
we'd all love to get our hands on it. :)
> To do this I need precise information about cpu, memory, disk
I/O, NICs, tape drives, cpu usage on a per process & user basis and I need
to be able to zoom in on a short period and filter out unwanted data.
Here's the problem:
The current Ganglia DTD doesn't allow for hierarchical metrics. They're
all flat. This makes collecting a table's worth of data (i.e. mounted
partition statistics, running process statistics) problematic at best. I
tried putting together something that hacks around it but it isn't pretty.
The developers list has been carrying talk about how to improve this in
Ganglia 3 for some months now (check them SF archives). As always, the
vaporware product addresses most of your needs. :)
In my tool I've done this by making the graphing completely dynamic
(controlled by the user) and by storing data in a database. To allow
gathering precise data and alot of it without creating any unwanted CPU
overhead, I decided to pick existing operating system tools as performance
agents. So on Tru64 I use /usr/sbin/collect, on Solaris I use SE Toolkit
combined with a custom SE script I've made.
The best would ofcourse be to develop my own kernel collectors, but lack of
time & knowledge prevented me from that.
The fact that you're developing on Tru64 doesn't make it any easier.
[you'd be using Ganglia on the two platforms that I initially ported it to...]
Monitoring these things with shell scripts is out of the question due to
CPU usage of such solutions.
What I currently do with Collect and SE Toolkit is to run with a 5 minute
interval. Rvery 5 minute these tools output average values for the last 5
minute and I then store these values in a central database.
gmetad stores data in the RRDs every 15 seconds, although that's a
user-configurable interval.
I'e reduced the data amount by eliminating data that doesn't interest me
such as processes without CPU usage, disks without I/O etc.
There's a "slope" attribute for each metric that notes whether the value
has changed significantly. I don't think gmetad pays attention to this
yet, although we were just talking about this on developers this or last
week. It's a relatively minor code change...
Anyway, given my needs I'd like to know what I can do with Ganglia. First
of all, I need much more data than the default operating system agents
provide. I can see that I can provide such data with scripts feeding
Ganglia and thus would be able to wrap output from a tool like Collect into
Ganglia data, but what do I do with the RRDs when it's dynamic data ?
Doing disk, user and process statistics I'll generate lots of data that
change all the time which means RRDs will need to be dynamically created.
Secondly, to view selected parts of such data, I assume I'd run into
problems with the current Ganglia web interface & RRD, right ?
I've previously tried using Cacti (the RRD web interface) and it's a no go.
Way too time consuming to configure and maintain. I need a solution that
can maintain itself and adjust itself to my everchanging systems (disks,
NICs, CPUs, filesystems etc. are added and removed daily).
Normally I don't like to respond to this sort of thing and say "RTFM/S" (S
== source), but the M's not that up-to-date and the code I contributed is
grody.
RRD is a dynamic data format whose resolution degrades over time (in a
user-configurable manner). It's attractive because of its fixed "database"
size (yet you can still "zoom in" on it to a certain degree). It's ugly
because librrdtool is non-reentrant and because each database is a separate
file on the filesystem...
Hosts and metrics disappear from a cluster after a certain amount of time.
The RRDs, however, remain. This will make very little difference to a
modern storage device, though (it's, what, 300k per host on the
metric-infused Linux platform?).
Another approach I've thought of is using the Ganglia agents &
infrastructure, but then offload the data with a wrapper script into a
database and then query the database like I do with my current solution.
I think you'd be best served by a combination approach. You may want to
write a nice script/app that polls gmetad every five minutes, parses the
relevant results, and inserts the results into a database. You could also
configure it to poll the data sources directly but that would involve more
work. The trouble is that gmetad gives you the whole XML dump every time -
there's no query/response method involved that can trim that down. That
kind of sucks in installations like mine where the entire XML feed is up to
3.6MB ... that's a lot of data to parse on every page load of the front-end.
Also, note that the Solaris platform code already has the necessary
(spaghetti) code in it to query via the kstat and kvm interfaces.
Tru64 also has the code in place using whatever freaky crazy interface
Digital uses. So, assuming you can actually *find* the right query
targets, you can copy and paste the existing code and change a few key
words here and there.
I think it's Tru64 where I really got the CPU calculation code to shine.
It adds up to 100% every time! No matter how many procs there are!
[or was that IRIX?]
This would give me the nice infrastructure and at the same time allow me
the visualisation I need, but still runs short of comprehensive kernel
agents (I'd still have to make some ugly interfacing between custom
collectors and Ganglia on each platform).
Ofcourse I could also just joing the Ganglia development and focus on my
needs :-)
Wasn't there just an article on Slashdot yesterday about bug-reporting
etiquette that featured the line, "Open Source does not mean 'The
developers will do my bidding'?"
Now, if you want to start writing multi-platform data source plug-ins for
the next version ...
[the idea of data *store* plug-ins is also being bandied about - in other
words, this "storing-data-in-RRDs" stuff would be moved to a plug-in, so
you'd be able to configure a data-collecting node to store its data in a
remote database or even unicast it to another node somewhere else, if you
like... which some people apparently do for reasons which I don't really
entirely understand ... ]
Hints, ideas, opinions on how to solve these things are welcome.
Two outta three ain't bad...
Good luck, whatever you end up doing. :)