Arjun,
I am glad that you brought this up since I am thinking about similar issues. I
haven't had time to do any real work yet, but I did have some ideas. I thought
perhaps I could pass them along in case they would be helpful to you. (And then
maybe you can weed out my bad ideas for me ;-)
I had done some previous calculations like Ramon and Martin to see how many
metrics would be reported per minute. It is quite a few. So my first thought
was that I would decrease the frequency to no more that once per minute for any
metric. That would be sufficient for my needs for the "important" metrics like
load and free memory. For things like cpu_num, etc. (which by default are
collected every 20 min), I would increase the time to something like once a day.
In that scenerio, I would only get about 25 metrics/min/node (assuming the
default metric set is used). This is about one million per day total for a 300
node cluster, and that number could be decreased even more if I identified the
"required" metrics and eliminate the others.
As far as getting those values into a database, I had four ideas:
1) Hack gmetad - Rewrite the function that puts values into the rrds, and make
it put values into MySQL (this was previously mentioned in this mailing list).
2) Write a gmetad-like program - Write a new program that periodically queried
gmond (just like gmetad does) and have it write the values to MySQL. (This too
was mentioned in this mailing list.)
3) Hack gmond - Since gmond is written to collect the metrics anyway, why not
use it as a starting point? It would act like a normal mute gmond that just
sits there and listens for metrics, and the you could write an extra bit that
periodically pushes the data out to MySQL. (I have no idea how easy/hard this
would be.)
One good thing about this approach is that you wouldn't have to
record the same data twice. For example, an idle node might not have many
metrics that surpass the configured threshold, so maybe it only retransmits the
values every 10 minutes instead of every 1 minute. The hacked gmond could write
that value once instead of the previous value 10 times. Or the "normal" gmond
clients could be configured to always send a value every 1 minute (ignoring
thesholds), and the hacked version could enforce thresholds to decide when data
should be stored in MySQL.
4) Use the rrds - Keep it simple. Use the ability in gmetad.conf to
modify the rrd archive "format". Make it keep one value every minute for a 24
hour period, and then configure gmetad to collect data every minute (instead of
the default 15 secs). The new rrds would keep 1440 values for each metric (as
opposed to the 966 values by default). Then periodically copy the rrds over to
another machine and write a script that just digs through them, pulls out new
values, and puts them into the database.
The last idea is probably the simplest to implement, but it does have several
drawbacks. For one, you won't have the most current data in MySQL. It would
still be available through the web interface, but that won't help you if you
need to so a SQL query on it. I don't know how much of an issue that is for
you.
So the scoreboard is:
Ideas = 4
Implementations = 0
:-)
Hopefully some of this is useful to you. I look forward to seeing how you
accomplish your task.
-- Rick
--------------------------
Rick Mohr
Systems Developer
Ohio Supercomputer Center
On Thu, 23 Feb 2006, Martin Knoblauch wrote:
Arjun, Ramon,
my numbers look a bit different, but equally disturbing:
lets assume 300 hosts with 36 metrics. I would not look at the RRD
format, but just store samples as they come from gmond.
That means we have 300x36 values per sample. About 12000.
Now lets assume the same sample rate as the 1hour resolution in RRD.
That gives 300x36x240. about 2.6 mio values per hour.
About 62.3 mio values per day
About 22.8e9 values per year
That is a lot of capacity and a lot of needed performance.
Of course, a lot of the metrics in Ganglia are not that interesting to
most people, or do not need the 15sec resolution.
Cheers
Martin
--- Ramon Bastiaans <[EMAIL PROTECTED]> wrote:
Arjun,
I think performance of this database will be a HUGE issue, depending
on
how many metrics/hosts/clusters and timespan that you wish to store.
Now please correct me if I'm wrong, but lets make a estimated
calcution
on what is going to be stored in the database.
Let's make a few assumptions; you are only storing the default
Ganglia
metrics and no custom/extra metrics and you have a cluster of 300
hosts.
Also, you don't archive any values and you use the same graph
resolution/value scheme as used by the RRD's from gmetad (the same
static amount of rotating values per resolution: hours, days, weeks,
months, years).
Metrics per host: 36
Hosts per cluster: 300
Now let's make a quick estimate on how many values you are going to
store in mySql.
Ganglia uses about 240 rows per resolution and 370 rows for the year
summary, this is 1330 rows per metric, per host.
1330 * 36 * 300 = 14364000 values.
This comes down to allmost 15 million values in your database, when
using the same style of value storing as currently done by Ganglia in
RRD's.
Now if you add:
- extra hosts: 1330 * 36 = 47880 values/host
- extra metrics: 1330 * 300 = 399000 values/metric
I don't know your particular setup, but here at SARA we monitor about
1800 machines in total with more than 50 metrics per host.
A quick estimate would come to 120 million values in the database.
Now imagine quering/selecting from such a database....
The performance would seem hell to me, making it totally unusable
from
the web environment where you want the values.
Also take into account that normally on the web frontend, graphs are
generated by RRD itself. But now you are using a database, so if you
want RRDTool to draw the graphs for you, you need to convert your
database values back to some sort of RRD format. This means a lot of
query'ing and converting of those values, each time a
(host/metric/whatever) graph is requested. This would also require
additional hacks (or changes) to the web frontend code as well.
I think you might need insane if not impossible hardware to support
such
a (database) setup, but anyone correct me if I'm wrong.
Kind regards,
- Ramon.
Arjun wrote:
In my case the monitoring db will be on a separate machine along
with
gmetad. I'm monitoring a cluster so can have a
separate (external) machine to store data on so I guess this will
not
be a performace bottleneck if I have a DB like
MySql to store and retreive data.
thanks
Arjun
--
ing. R. Bastiaans HPC - Systems Programmer
SARA - Computing and Networking Services
Kruislaan 415 PO Box 194613
1098 SJ Amsterdam 1090 GP Amsterdam
Tel. +31 (0) 20 592 3000 Fax. +31 (0) 20 668 3167
---
There are really only three types of people:
Those who make things happen, those who watch things happen
and those who say, "What happened?"
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting
language
that extends applications into web and mobile media. Attend the live
webcast
and join the prime developer group breaking into this new coding
territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Ganglia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-developers
------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www: http://www.knobisoft.de
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Ganglia-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-developers