efficient storage of timeseries

Eric Maland Sun, 14 Feb 2010 21:32:47 -0800

Hi,

I'm pretty new to Hbase so bear with me - I am porting over a system for
storage of timeseries data to both HBase and Cassandra.  This is pretty
straightforward but it opens a ton of questions about how the schema should
be designed and how the data should be stored.


I'm hoping some of you might have some advice on what route I should settle
on.  I'd like to over some ideas and questions they raise.

Let's say we're fetching data about hosts like load average, disk free, etc.
 Ideas I've come up with thus far are:

row key: host
cf/qualifier: attribute:name
value: whatever
row timestamp: time of sampling

Example:

ROW                          COLUMN+CELL

 butters                        column=attribute:load_1,
timestamp=1266208871, value=0.3
 butters                        column=attribute:load_1,
timestamp=1266208853, value=0.3
 butters                        column=attribute:load_1,
timestamp=1266208324, value=0.3
 butters                        column=attribute:load_5,
timestamp=1266208871, value=1.3
 butters                        column=attribute:load_5,
timestamp=1266208853, value=1.3
 butters                        column=attribute:load_5,
timestamp=1266208325, value=1.3
 kenny                         column=attribute:load_1,
timestamp=1266208871, value=0.6
 kenny                         column=attribute:load_1,
timestamp=1266208853, value=0.6
 kenny                         column=attribute:load_1,
timestamp=1266208324, value=0.6

etc.

You can easily fetch multiple versions using setTimeRange(), and you
can potentially add new data sampled for a host just by adding another
qualifier for that row (eg load_5 or disk_free or whatever).

The downside, or part I don't really understand here, is trying to figure
out how many versions you have (count 'table' doesn't report version counts)
and configuring how many versions you store, it appears to only store 3 by
default.  I don't know how to change that, and I don't know if versions of a
row are physically stored next to the previous versions of the row or not.
 Is this a suboptimal configuration if we expect the data to grow to several
terabytes within the year (or say, 3 months)?

Another option:

row key: host-timestamp
cf/qualifier: attribute:name
value: whatever
row timestamp: time of row insert

ROW                          COLUMN+CELL

 butters-1266208871     column=attribute:load_1, value=0.3

 butters-1266208853     column=attribute:load_1, value=0.3

 butters-1266208324     column=attribute:load_1, value=0.3

 butters-1266208871     column=attribute:load_5, value=1.3

 butters-1266208853     column=attribute:load_5, value=1.3

 butters-1266208325     column=attribute:load_5, value=1.3

 kenny-1266208871      column=attribute:load_1, value=0.6

 kenny-1266208853      column=attribute:load_1, value=0.6

 kenny-1266208324      column=attribute:load_1, value=0.6

...

This duplicates the timestamp in the key, which seems wasteful but gives
proper counts and you can just key range searches to fetch rows in a range.
 Is this going to be slower but spread data out further?  Is this a good
idea?  Is filtering on a column family's qualifier efficient (say I want
load_1 for the last 2 years for a host, but there are 1000 or so qualifiers
per row.. is that going to explode anywhere?)?

I've also thought about:

row key: host-attribute
cf/qualifier: value:
value: whatever
row timestamp: time of row insert

or:

row key: host-attribute-timestamp
cf/qualifier: value:
value: whatever
row timestamp: time of row insert

But these also seem like suboptimal methods.  If anyone has experience or
thoughts on this I'd love to hear them!

Also, I incorporated the Thrift server changes from
http://issues.apache.org/jira/browse/HBASE-1744 into the current Thrift
server to get at the time range and version functionality (I added a few
features to the API and fixed it so it'll work with Python).  If anyone is
interested in having that patch for the current thrift server let me know,
it's simple and I'll happily share it.  I'll likely stuff it up on github
soon.

Thanks!
Eric

efficient storage of timeseries

Reply via email to