Hi, I'm pretty new to Hbase so bear with me - I am porting over a system for storage of timeseries data to both HBase and Cassandra. This is pretty straightforward but it opens a ton of questions about how the schema should be designed and how the data should be stored.
I'm hoping some of you might have some advice on what route I should settle on. I'd like to over some ideas and questions they raise. Let's say we're fetching data about hosts like load average, disk free, etc. Ideas I've come up with thus far are: row key: host cf/qualifier: attribute:name value: whatever row timestamp: time of sampling Example: ROW COLUMN+CELL butters column=attribute:load_1, timestamp=1266208871, value=0.3 butters column=attribute:load_1, timestamp=1266208853, value=0.3 butters column=attribute:load_1, timestamp=1266208324, value=0.3 butters column=attribute:load_5, timestamp=1266208871, value=1.3 butters column=attribute:load_5, timestamp=1266208853, value=1.3 butters column=attribute:load_5, timestamp=1266208325, value=1.3 kenny column=attribute:load_1, timestamp=1266208871, value=0.6 kenny column=attribute:load_1, timestamp=1266208853, value=0.6 kenny column=attribute:load_1, timestamp=1266208324, value=0.6 etc. You can easily fetch multiple versions using setTimeRange(), and you can potentially add new data sampled for a host just by adding another qualifier for that row (eg load_5 or disk_free or whatever). The downside, or part I don't really understand here, is trying to figure out how many versions you have (count 'table' doesn't report version counts) and configuring how many versions you store, it appears to only store 3 by default. I don't know how to change that, and I don't know if versions of a row are physically stored next to the previous versions of the row or not. Is this a suboptimal configuration if we expect the data to grow to several terabytes within the year (or say, 3 months)? Another option: row key: host-timestamp cf/qualifier: attribute:name value: whatever row timestamp: time of row insert ROW COLUMN+CELL butters-1266208871 column=attribute:load_1, value=0.3 butters-1266208853 column=attribute:load_1, value=0.3 butters-1266208324 column=attribute:load_1, value=0.3 butters-1266208871 column=attribute:load_5, value=1.3 butters-1266208853 column=attribute:load_5, value=1.3 butters-1266208325 column=attribute:load_5, value=1.3 kenny-1266208871 column=attribute:load_1, value=0.6 kenny-1266208853 column=attribute:load_1, value=0.6 kenny-1266208324 column=attribute:load_1, value=0.6 ... This duplicates the timestamp in the key, which seems wasteful but gives proper counts and you can just key range searches to fetch rows in a range. Is this going to be slower but spread data out further? Is this a good idea? Is filtering on a column family's qualifier efficient (say I want load_1 for the last 2 years for a host, but there are 1000 or so qualifiers per row.. is that going to explode anywhere?)? I've also thought about: row key: host-attribute cf/qualifier: value: value: whatever row timestamp: time of row insert or: row key: host-attribute-timestamp cf/qualifier: value: value: whatever row timestamp: time of row insert But these also seem like suboptimal methods. If anyone has experience or thoughts on this I'd love to hear them! Also, I incorporated the Thrift server changes from http://issues.apache.org/jira/browse/HBASE-1744 into the current Thrift server to get at the time range and version functionality (I added a few features to the API and fixed it so it'll work with Python). If anyone is interested in having that patch for the current thrift server let me know, it's simple and I'll happily share it. I'll likely stuff it up on github soon. Thanks! Eric