On Tue, Aug 24, 2010 at 20:22, Bernard Li <[email protected]> wrote: > Hi all: > > Over the years there has been numerous discussions about putting > Ganglia metrics into a relational database such as MySQL. I would > like to bring that discussion up again but just focus on the schema > design. > > There are two ideas that we can work with: > > 1) 1 table per host/metric > > The table will most likely have around 4 columns: host, metric, > timestamp, value, etc.. This approach allows you to pre-define the > datatype of the metric (could be int, float, etc.). But this way if > you are monitoring 30 metrics from 1000 hosts, you will have 30,000 > tables. You will probably also have other tables having relations > between grids, clusters and hosts, etc.
Bad idea. I've dealt with schemas like this, and it's a disaster. > > 2) 1 flat table for all metrics > > This suggestion is from Spike and the idea is you store values as > string and also have another column for the datatype. Better, but still not great. I like having a single large table, since it keeps things simple. For comparison, I believe that he default RRD settings in gemtad will store 1350 rows of data across all of the RRAs. Most of these, of course, are consolidated data points, whereas I assume a "real" database will only store the actual data. A quick check of my largest ganglia install shows about 10,500 files, and 14,093,810 rows (summing up the data from 'rrdtool info'). Any halfway decent database on halfway decent hardware should be able to handle that. Dealing with strings is mess as well, and having to cast thousands of strings into ints and floats is expensive (if you have to do it all the time). I don't really like that idea. Unfortunately, we fundamentally have to deal with different types of data. So three more ideas to kick around 3) Use one table per host, with a column for metric types. each table would have columns like: timestamp, metric, int64, double, If the data is integral in nature, it is cast to int64. If floating point, it becomes a double. One of these columns in each row will be NULL. Metric could be a CHAR (for speed), or an INT (referencing another table). 4) Use multiple tables for each host, but make a new one for each datatype. Thus, for host "foo", you'd have at least two tables: foo_int and foo_float. This assumes you are promoting everything to a canonical integer and floating point type. If you want to be more fine-grained about it, you could have a table for each specific datatype, The table would have two columns: timestamp, metric (as above), and data (which would be of the appropriate type). If you didn't want to cast data from, say int8 to int64, you could create a different table for each datatype. 5) Don't use SQL, but use some other sort of data storage mechanism, nosql, hadoopfs, etc > > If you have any other ideas, please throw them into the mix here. The > schema design would also largely be affected by the desired usage of > these metrics data. Yep. I'd also suggest that slavish adherence to normalizing the tables will make things slow. :) It's not a bad thing, but I'll trade storage space for speed in this case. -- Jesse Becker ------------------------------------------------------------------------------ Sell apps to millions through the Intel(R) Atom(Tm) Developer Program Be part of this innovative community and reach millions of netbook users worldwide. Take advantage of special opportunities to increase revenue and speed time-to-market. Join now, and jumpstart your future. http://p.sf.net/sfu/intel-atom-d2d _______________________________________________ Ganglia-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/ganglia-general

