If the rowkey is date/time and the data is original sequential by date/time,
when load/insert data into table, only one region (the one node) is active
to receive new data. The load performance will be pool.

On Wed, Apr 1, 2009 at 10:25 AM, Bradford Cross
<[email protected]>wrote:

> Greetings,
>
> I am prototyping a financial time series database on top of HBase and
> trying
> to head my head around what a good design would look like.
>
> As I understand it, I have rows, column families, columns and cells.
>
> Since the only think that Hbase really "indexes" is row keys, it seems
> natural in a way to represent the rowkeys as the date/time.
>
> As a simple example:
>
> Bar data:
>
> {
>   "2009/1/17" : {
>     "open":"100",
>     "high":"102",
>     "low":"99",
>     "close":"101"
>     "volume":"1000256"
>   }
> }
>
>
> Quote data:
>
> {
>   "2009/1/17:11:23:04" : {
>     "bid":"100.01",
>     "ask":"100.02",
>     "bidsize":"10000",
>     "asksize":"100200"
>   }
> }
>
> But there are many other issues to think about.
>
> In financial time series data we have small amounts of data within each
> "observation" and we can have lots of observations.  We can have millions
> of
> observations per time series (f.ex. all historical trade and quote date for
> a particular stock since 1993)across hundreds of thousands of individual
> instruments (f.ex. across all stocks that have traded since 1993.)
>
> The write patterns fit HBase nicely, because it is a write once and append
> pattern.  This is followed by loads of offline processes for simulating
> trading models and such.  These query patterns look like "all quotes for
> all
> stocks between the dates of 1/1/996 and 12/31/2008."  So the querying is
> typically across a date range, and we can further filter the query by
> instrument types.
>
> So I am not sure what makes sense for efficiency because I do not
> understand
> HBase well enough yet.
>
>  What kinds of mixes of rows, column families, and columns should I be
> thinking about?
>
> Does my simplistic approach make any sense?  That would mean each row is a
> key-value pair where the key is is the date/time and the value is the
> "observation."  I suppose this leads to a "table per time series" model.
> Does that make sense or is there overhead to having lots of tables?
>

Reply via email to