If the rowkey is date/time and the data is original sequential by date/time, when load/insert data into table, only one region (the one node) is active to receive new data. The load performance will be pool.
On Wed, Apr 1, 2009 at 10:25 AM, Bradford Cross <[email protected]>wrote: > Greetings, > > I am prototyping a financial time series database on top of HBase and > trying > to head my head around what a good design would look like. > > As I understand it, I have rows, column families, columns and cells. > > Since the only think that Hbase really "indexes" is row keys, it seems > natural in a way to represent the rowkeys as the date/time. > > As a simple example: > > Bar data: > > { > "2009/1/17" : { > "open":"100", > "high":"102", > "low":"99", > "close":"101" > "volume":"1000256" > } > } > > > Quote data: > > { > "2009/1/17:11:23:04" : { > "bid":"100.01", > "ask":"100.02", > "bidsize":"10000", > "asksize":"100200" > } > } > > But there are many other issues to think about. > > In financial time series data we have small amounts of data within each > "observation" and we can have lots of observations. We can have millions > of > observations per time series (f.ex. all historical trade and quote date for > a particular stock since 1993)across hundreds of thousands of individual > instruments (f.ex. across all stocks that have traded since 1993.) > > The write patterns fit HBase nicely, because it is a write once and append > pattern. This is followed by loads of offline processes for simulating > trading models and such. These query patterns look like "all quotes for > all > stocks between the dates of 1/1/996 and 12/31/2008." So the querying is > typically across a date range, and we can further filter the query by > instrument types. > > So I am not sure what makes sense for efficiency because I do not > understand > HBase well enough yet. > > What kinds of mixes of rows, column families, and columns should I be > thinking about? > > Does my simplistic approach make any sense? That would mean each row is a > key-value pair where the key is is the date/time and the value is the > "observation." I suppose this leads to a "table per time series" model. > Does that make sense or is there overhead to having lots of tables? >
