Based on reading the hbase architecture wiki, I have changed my thinking due
to the "Column Family Centric Storage."

HBase stores column families physically close on disk, so the items in a
given column family should have roughly the same read/write characteristics
and contain similar data.  Although at a conceptual level, tables may be
viewed as a sparse set of rows, physically they are stored on a per-column
family basis. This is an important consideration for schema and application
designers to keep in mind.

This leads me to the thought of keeping an entire time series inside a
single column family.

Options:

Row key is a ticker symbol:
- hijack time stamp to be the time of each observation.  Use a column family
to hold all the data, and a column for each property of  each observation.
-don't hijack the time stamp, just ignore it.  Use a column family for all
the data, and use an individual column for the date/time of the observation,
and individual columns for each property of each observation.

thoughts?

On Tue, Mar 31, 2009 at 7:25 PM, Bradford Cross
<[email protected]>wrote:

> Greetings,
>
> I am prototyping a financial time series database on top of HBase and
> trying to head my head around what a good design would look like.
>
> As I understand it, I have rows, column families, columns and cells.
>
> Since the only think that Hbase really "indexes" is row keys, it seems
> natural in a way to represent the rowkeys as the date/time.
>
> As a simple example:
>
> Bar data:
>
> {
>    "2009/1/17" : {
>      "open":"100",
>      "high":"102",
>      "low":"99",
>      "close":"101"
>      "volume":"1000256"
>    }
> }
>
>
> Quote data:
>
> {
>    "2009/1/17:11:23:04" : {
>      "bid":"100.01",
>      "ask":"100.02",
>      "bidsize":"10000",
>      "asksize":"100200"
>    }
> }
>
> But there are many other issues to think about.
>
> In financial time series data we have small amounts of data within each
> "observation" and we can have lots of observations.  We can have millions of
> observations per time series (f.ex. all historical trade and quote date for
> a particular stock since 1993)across hundreds of thousands of individual
> instruments (f.ex. across all stocks that have traded since 1993.)
>
> The write patterns fit HBase nicely, because it is a write once and append
> pattern.  This is followed by loads of offline processes for simulating
> trading models and such.  These query patterns look like "all quotes for all
> stocks between the dates of 1/1/996 and 12/31/2008."  So the querying is
> typically across a date range, and we can further filter the query by
> instrument types.
>
> So I am not sure what makes sense for efficiency because I do not
> understand HBase well enough yet.
>
>  What kinds of mixes of rows, column families, and columns should I be
> thinking about?
>
> Does my simplistic approach make any sense?  That would mean each row is a
> key-value pair where the key is is the date/time and the value is the
> "observation."  I suppose this leads to a "table per time series" model.
> Does that make sense or is there overhead to having lots of tables?
>

Reply via email to