Hi! I have a couple of questions related to Ganglia. I've read some of the Ganglia documentation but haven't tried installing it yet.
I'm currently using a home-developed solution to monitor about 100 large Tru64 & Solaris machines. Some clustered, some none-clustered. None of them are large clusters, just 2-4 nodes, and each machine/cluster is running different stuff... not parallel/distributed research stuff but just normal business stuff (Oracle databases, Tuxedo, SAP, WebLogic etc.). It's a telco company. The home-developed solution I've made to monitor these machines is available at http://statdb.dassic.com/ (screenshots at http://statdb.dassic.com/docs/screenshots.php ). It works fine in the current environment, especially on Tru64 which is my primary platform, but I'd like to take it a little further related to design, platform support and scaleability. This would either require a complete rewrite/redesign of my current solution or adaption of another solution such as Ganglia. The problem is, as I see it, that none of the other tools in their current form allow me to do what I want. There are tools out there that does exactly what I want, but they're all coming with a big pricetag. As far as I can see, Ganglia has good platform support, good framework design & scaleability, but I think it runs short of my needs when it comes to what data it collets & presents and how I can decide to view the data. ...so that's what I'd like to ask about...what is possible & how with Ganglia related to my needs. What I usually need is a tool that can tell me, and allow me to tell some service responsible who's usually the one asking, why some machine/application isn't performing well either currently or sometime in the past. To do this I need precise information about cpu, memory, disk I/O, NICs, tape drives, cpu usage on a per process & user basis and I need to be able to zoom in on a short period and filter out unwanted data. In my tool I've done this by making the graphing completely dynamic (controlled by the user) and by storing data in a database. To allow gathering precise data and alot of it without creating any unwanted CPU overhead, I decided to pick existing operating system tools as performance agents. So on Tru64 I use /usr/sbin/collect, on Solaris I use SE Toolkit combined with a custom SE script I've made. The best would ofcourse be to develop my own kernel collectors, but lack of time & knowledge prevented me from that. Monitoring these things with shell scripts is out of the question due to CPU usage of such solutions. What I currently do with Collect and SE Toolkit is to run with a 5 minute interval. Rvery 5 minute these tools output average values for the last 5 minute and I then store these values in a central database. The performance collectors and the central database are interconnected with some Perl code to send & recieve data. My solution is used elsewhere (it's OpenSource) with other intervals and usually scale up to a couple of hundred systems even with shorter intervals such as 1 minute, but it won't scale up like Ganglia. I use MySQL (using InnoDB tables) or Oracle as database and there's a limit to how much data I can insert per second. I saw in the docs that Ganglia started out with MySQL too and ran into performance problems. Me too, but I managed to overcome the first bottleneck by changing to InnoDB, turn of database logging and various tuning. In my own setup with about 100 machines I don't have any problems currently, and CPU usage from the database is low. I have around 15 inserts queries per second. I'e reduced the data amount by eliminating data that doesn't interest me such as processes without CPU usage, disks without I/O etc. Anyway, given my needs I'd like to know what I can do with Ganglia. First of all, I need much more data than the default operating system agents provide. I can see that I can provide such data with scripts feeding Ganglia and thus would be able to wrap output from a tool like Collect into Ganglia data, but what do I do with the RRDs when it's dynamic data ? Doing disk, user and process statistics I'll generate lots of data that change all the time which means RRDs will need to be dynamically created. Secondly, to view selected parts of such data, I assume I'd run into problems with the current Ganglia web interface & RRD, right ? I've previously tried using Cacti (the RRD web interface) and it's a no go. Way too time consuming to configure and maintain. I need a solution that can maintain itself and adjust itself to my everchanging systems (disks, NICs, CPUs, filesystems etc. are added and removed daily). Another approach I've thought of is using the Ganglia agents & infrastructure, but then offload the data with a wrapper script into a database and then query the database like I do with my current solution. This would give me the nice infrastructure and at the same time allow me the visualisation I need, but still runs short of comprehensive kernel agents (I'd still have to make some ugly interfacing between custom collectors and Ganglia on each platform). Ofcourse I could also just joing the Ganglia development and focus on my needs :-) Hints, ideas, opinions on how to solve these things are welcome. /Jesper

