My apologies for resending. I just noticed that the formatting of the 
data in my previous mail isnt too easy on the eye, so I fixed that:

Hi everyone,

I've just recently become aware of FastBit and have been toying around
with it for a while. I am very impressed with its performance on large
time series data sets.

I have a question though in regards to as to how to structure my data in
order to achieve best performance. Basically my data is to a very large
extent homogeneous. I'd say ~90% of the entire set would span the same
number of columns and represents High Frequency market order book data/
messages. These messages either come in the form of depth or trade
snapshots with the former comprising 90% of the dataset.

For instances, a typical depth record would look like,

ColumnNames:    Values:
TimeStamp       20100930063000100
TradingCode     ESH0
BidLevel1QTY    10
BidLevel1Price  1400
...     
AskLevel1QTY    13
AskLevel1Price  1402
…       


A typical trade record would look like,

ColumnNames:    Values:
TimeStamp       20100930063000100
TradingCode     ESH0
PriceQTY        10
Price           1402


I believe that gives me several possibilities to structure my set, ie:
1. Use a flat format similar to this:

ColumnNames:    Values(Depth):          Values(Trade):
TimeStamp       20100930063000100       20100930063000100
TradingCode     ESH0                    ESH0
TradeOrDepth    D                       T
PriceQty        0                       10
Price   0       1402
BidLevel1QTY    10                      0
BidLevel1Price  1400                    0
...                             
AskLevel1QTY    13                      0
AskLevel1Price  1402                    0
…               


2. Another option would be to use a relational format with indexes, ie

Table 1:
ColumnNames:    Values:                 Values:
UniqueID        100                     101
TimeStamp       20100930063000100       20100930063000100
TradeOrDepth    D                       T

Table 2 (DepthsTrades table):
ColumnNames:    Values: Values:
ForeignID       100     101
TradeQty        0       10
TradePrice      0       1402
BidPriceQty     10      0
BidPrice        1400    0
AskPriceQty     13      0
AskPrice        1402    0
Level           1       0


With  option 1 I'd have quite a few redundancies, especially when the
data comprises more than one level for a single timestamp. Up to 10
levels on both sides would be normal. Since it'd affect 10% of the data
only it wouldn't be too bad though.

With Option 2 I wouldn't have as many redundancies, but would have to
join tables when querying for anything other than date_time ranges.
Would that be a more (time) costly operation relative to option 1? I am
pretty much mostly interested in getting the best retrieval speeds, plus
from what I understand, FastBit's compression would be handling a bunch
of zeros rather well storage size wise.

I'd appreciate your thoughts.

Regards
Tobias


_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to