Hi everyone, I've just recently become aware of FastBit and have been toying around with it for a while. I am very impressed with its performance on large time series data sets.
I have a question though in regards to as to how to structure my data in order to achieve best performance. Basically my data is to a very large extent homogeneous. I'd say ~90% of the entire set would span the same number of columns and represents High Frequency market order book data/ messages. These messages either come in the form of depth or trade snapshots with the former comprising 90% of the dataset. For instances, a typical depth record would look like, ColumnNames: TimeStamp , TradingCode, BidLevel1QTY, BidLevel1Price,...,AskLevel1QTY,AskLevel1Price,... Values : 20100930063000123, ESH0, 10, 1400, 13, 1402 A typical trade record would look like, ColumnNames: TimeStamp, TradingCode, PriceQTY, Price Values : 20100930063000123, ESH0, 10, 1402 I believe that gives me several possibilities to structure my set, ie: 1. Use a flat format similar to this: ColumnNames: TimeStamp, TradingCode,TradeOrDepth, PriceQty, Price, BidLevel1QTY, BidLevel1Price,...,AskLevel1QTY,AskLevel1Price,... Values(Depth): 20100930063000123, ESH0, D, 0, 0, 10, 1400, 13, 1402 Values(Trade): 20100930063000123, ESH0, T, 10, 1402, 0, 0, 0, 0 2. Another option would be to use a relational format with indexes, ie Table 1: ColumnNames: UniqueID, TimeStamp, TradeOrDepth Values: 100 ,20100930063000123, D Values: 101 ,20100930063000123, T Table 2 (DepthsTrades table): ColumnNames: ForeignID, TradeQty, TradePrice, BidPriceQty, BidPrice, AskPriceQty, AskPrice,Level Values: 100, 0, 0, 10, 1400, 13, 1402, 1 Values: 101, 10, 1402, 0, 0, 0, 0, 0 With option 1 I'd have quite a few redundancies, especially when the data comprises more than one level for a single timestamp. Up to 10 levels on both sides would be normal. Since it'd affect 10% of the data only it wouldn't be too bad though. With Option 2 I wouldn't have as many redundancies, but would have to join tables when querying for anything other than date_time ranges. Would that be a more (time) costly operation relative to option 1? I am pretty much mostly interested in getting the best retrieval speeds, plus from what I understand, FastBit's compression would be handling a bunch of zeros rather well storage size wise. I'd appreciate your thoughts. Regards Tobias _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
