Hi everyone,

I've just recently become aware of FastBit and have been toying around 
with it for a while. I am very impressed with its performance on large 
time series data sets.

I have a question though in regards to as to how to structure my data in 
order to achieve best performance. Basically my data is to a very large 
extent homogeneous. I'd say ~90% of the entire set would span the same 
number of columns and represents High Frequency market order book data/ 
messages. These messages either come in the form of depth or trade 
snapshots with the former comprising 90% of the dataset.

For instances, a typical depth record would look like,

ColumnNames:     TimeStamp                  , TradingCode, BidLevel1QTY, 
BidLevel1Price,...,AskLevel1QTY,AskLevel1Price,...
Values              :     20100930063000123, ESH0,             
10,                    1400,                    13,                   1402

A typical trade record would look like,

ColumnNames:     TimeStamp,                   TradingCode, PriceQTY, Price
Values              :     20100930063000123, ESH0,             
10,           1402

I believe that gives me several possibilities to structure my set, ie:
1. Use a flat format similar to this:

ColumnNames:     TimeStamp,                   TradingCode,TradeOrDepth, 
PriceQty, Price, BidLevel1QTY, 
BidLevel1Price,...,AskLevel1QTY,AskLevel1Price,...
Values(Depth):      20100930063000123, ESH0,             
D,                      0,              0,      10,                    
1400,                    13,                   1402
Values(Trade):       20100930063000123, ESH0,             
T,                       10,           1402,     0,                      
0,                           0,                     0

2. Another option would be to use a relational format with indexes, ie

Table 1:
  ColumnNames:     UniqueID,       TimeStamp,      TradeOrDepth
Values:                     100        ,20100930063000123,   D
Values:                     101        ,20100930063000123,   T

Table 2 (DepthsTrades table):
ColumnNames:      ForeignID, TradeQty, TradePrice,  BidPriceQty, 
BidPrice, AskPriceQty, AskPrice,Level
Values:                     100,            0,               
0,               10,                  1400,      13,                  
1402,    1
Values:                     101,            10,            1402,         
0,                    0,             0,                    0,           0

With  option 1 I'd have quite a few redundancies, especially when the 
data comprises more than one level for a single timestamp. Up to 10 
levels on both sides would be normal. Since it'd affect 10% of the data 
only it wouldn't be too bad though.

With Option 2 I wouldn't have as many redundancies, but would have to 
join tables when querying for anything other than date_time ranges. 
Would that be a more (time) costly operation relative to option 1? I am 
pretty much mostly interested in getting the best retrieval speeds, plus 
from what I understand, FastBit's compression would be handling a bunch 
of zeros rather well storage size wise.

I'd appreciate your thoughts.

Regards
Tobias

_______________________________________________
FastBit-users mailing list
[email protected]
https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users

Reply via email to