Hi!

I have a couple of questions related to Ganglia.
I've read some of the Ganglia documentation but haven't tried installing it 
yet.

I'm currently using a home-developed solution to monitor about 100 large 
Tru64 & Solaris machines. Some clustered, some none-clustered. None of them 
are large clusters, just 2-4 nodes, and each machine/cluster is running 
different stuff... not parallel/distributed research stuff but just normal 
business stuff (Oracle databases, Tuxedo, SAP, WebLogic etc.). It's a telco 
company.
The home-developed solution I've made to monitor these machines is 
available at http://statdb.dassic.com/ (screenshots at 
http://statdb.dassic.com/docs/screenshots.php ).
It works fine in the current environment, especially on Tru64 which is my 
primary platform, but I'd like to take it a little further related to 
design, platform support and scaleability.
This would either require a complete rewrite/redesign of my current 
solution or adaption of another solution such as Ganglia.
The problem is, as I see it, that none of the other tools in their current 
form allow me to do what I want. There are tools out there that does 
exactly what I want, but they're all coming with a big pricetag.

As far as I can see, Ganglia has good platform support, good framework 
design & scaleability, but I think it runs short of my needs when it comes 
to what data it collets & presents and how I can decide to view the data.
...so that's what I'd like to ask about...what is possible & how with 
Ganglia related to my needs.

What I usually need is a tool that can tell me, and allow me to tell some 
service responsible who's usually the one asking, why some 
machine/application isn't performing well either currently or sometime in 
the past. To do this I need precise information about cpu, memory, disk 
I/O, NICs, tape drives, cpu usage on a per process & user basis and I need 
to be able to zoom in on a short period and filter out unwanted data.
In my tool I've done this by making the graphing completely dynamic 
(controlled by the user) and by storing data in a database. To allow 
gathering precise data and alot of it without creating any unwanted CPU 
overhead, I decided to pick existing operating system tools as performance 
agents. So on Tru64 I use /usr/sbin/collect, on Solaris I use SE Toolkit 
combined with a custom SE script I've made.
The best would ofcourse be to develop my own kernel collectors, but lack of 
time & knowledge prevented me from that.
Monitoring these things with shell scripts is out of the question due to 
CPU usage of such solutions.
What I currently do with Collect and SE Toolkit is to run with a 5 minute 
interval. Rvery 5 minute these tools output average values for the last 5 
minute and I then store these values in a central database.
The performance collectors and the central database are interconnected with 
some Perl code to send & recieve data.
My solution is used elsewhere (it's OpenSource) with other intervals and 
usually scale up to a couple of hundred systems even with shorter intervals 
such as 1 minute, but it won't scale up like Ganglia. I use MySQL (using 
InnoDB tables) or Oracle as database and there's a limit to how much data I 
can insert per second. I saw in the docs that Ganglia started out with 
MySQL too and ran into performance problems. Me too, but I managed to 
overcome the first bottleneck by changing to InnoDB, turn of database 
logging and various tuning. In my own setup with about 100 machines I don't 
have any problems currently, and CPU usage from the database is low. I have 
around 15 inserts queries per second.
I'e reduced the data amount by eliminating data that doesn't interest me 
such as processes without CPU usage, disks without I/O etc.

Anyway, given my needs I'd like to know what I can do with Ganglia. First 
of all, I need much more data than the default operating system agents 
provide. I can see that I can provide such data with scripts feeding 
Ganglia and thus would be able to wrap output from a tool like Collect into 
Ganglia data, but what do I do with the RRDs when it's dynamic data ?
Doing disk, user and process statistics I'll generate lots of data that 
change all the time which means RRDs will need to be dynamically created.
Secondly, to view selected parts of such data, I assume I'd run into 
problems with the current Ganglia web interface & RRD, right ?
I've previously tried using Cacti (the RRD web interface) and it's a no go. 
Way too time consuming to configure and maintain. I need a solution that 
can maintain itself and adjust itself to my everchanging systems (disks, 
NICs, CPUs, filesystems etc. are added and removed daily).

Another approach I've thought of is using the Ganglia agents & 
infrastructure, but then offload the data with a wrapper script into a 
database and then query the database like I do with my current solution.
This would give me the nice infrastructure and at the same time allow me 
the visualisation I need, but still runs short of comprehensive kernel 
agents (I'd still have to make some ugly interfacing between custom 
collectors and Ganglia on each platform).
Ofcourse I could also just joing the Ganglia development and focus on my 
needs :-)

Hints, ideas, opinions on how to solve these things are welcome.

/Jesper

Reply via email to