Re: [Ganglia-general] Ganglia metrics in MySQL Part I -- the schema

Jesse Becker Tue, 24 Aug 2010 18:09:55 -0700

On Tue, Aug 24, 2010 at 20:22, Bernard Li <[email protected]> wrote:
> Hi all:
>
> Over the years there has been numerous discussions about putting
> Ganglia metrics into a relational database such as MySQL.  I would
> like to bring that discussion up again but just focus on the schema
> design.
>
> There are two ideas that we can work with:
>
> 1) 1 table per host/metric
>
> The table will most likely have around 4 columns: host, metric,
> timestamp, value, etc.. This approach allows you to pre-define the
> datatype of the metric (could be int, float, etc.).  But this way if
> you are monitoring 30 metrics from 1000 hosts, you will have 30,000
> tables.  You will probably also have other tables having relations
> between grids, clusters and hosts, etc.


Bad idea.  I've dealt with schemas like this, and it's a disaster.

>
> 2) 1 flat table for all metrics
>
> This suggestion is from Spike and the idea is you store values as
> string and also have another column for the datatype.

Better, but still not great.  I like having a single large table,
since it keeps things simple.  For comparison, I believe that he
default RRD settings in gemtad will store 1350 rows of data across all
of the RRAs.  Most of these, of course, are consolidated data points,
whereas I assume a "real" database will only store the actual data.  A
quick check of my largest ganglia install shows about 10,500 files,
and 14,093,810 rows (summing up the data from 'rrdtool info').  Any
halfway decent database on halfway decent hardware should be able to
handle that.

Dealing with strings is mess as well, and having to cast thousands of
strings into ints and floats is expensive (if you have to do it all
the time).  I don't really like that idea.  Unfortunately, we
fundamentally have to deal with different types of data.

So three more ideas to kick around

3)  Use one table per host, with a column for metric types.  each
table would have columns like:
  timestamp, metric, int64, double,

If the data is integral in nature, it is cast to int64.  If floating
point, it becomes a double.  One of these columns in each row will be
NULL.  Metric could be a CHAR (for speed), or an INT (referencing
another table).

4)  Use multiple tables for each host, but make a new one for each
datatype.  Thus, for host "foo", you'd have at least two tables:
foo_int and foo_float.  This assumes you are promoting everything to a
canonical integer and floating point type.  If you want to be more
fine-grained about it, you could have a table for each specific
datatype,

The table would have two columns:  timestamp, metric (as above), and
data (which would be of the appropriate type).

If you didn't want to cast data from, say int8 to int64, you could
create a different table for each datatype.

5) Don't use SQL, but use some other sort of data storage mechanism,
nosql, hadoopfs, etc

>
> If you have any other ideas, please throw them into the mix here.  The
> schema design would also largely be affected by the desired usage of
> these metrics data.

Yep.  I'd also suggest that slavish adherence to normalizing the
tables will make things slow. :)  It's not a bad thing, but I'll trade
storage space for speed in this case.


-- 
Jesse Becker

------------------------------------------------------------------------------
Sell apps to millions through the Intel(R) Atom(Tm) Developer Program
Be part of this innovative community and reach millions of netbook users 
worldwide. Take advantage of special opportunities to increase revenue and 
speed time-to-market. Join now, and jumpstart your future.
http://p.sf.net/sfu/intel-atom-d2d
_______________________________________________
Ganglia-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/ganglia-general

Re: [Ganglia-general] Ganglia metrics in MySQL Part I -- the schema

Reply via email to