My apologies for resending. I just noticed that the formatting of the data in my previous mail isnt too easy on the eye, so I fixed that:
Hi everyone, I've just recently become aware of FastBit and have been toying around with it for a while. I am very impressed with its performance on large time series data sets. I have a question though in regards to as to how to structure my data in order to achieve best performance. Basically my data is to a very large extent homogeneous. I'd say ~90% of the entire set would span the same number of columns and represents High Frequency market order book data/ messages. These messages either come in the form of depth or trade snapshots with the former comprising 90% of the dataset. For instances, a typical depth record would look like, ColumnNames: Values: TimeStamp 20100930063000100 TradingCode ESH0 BidLevel1QTY 10 BidLevel1Price 1400 ... AskLevel1QTY 13 AskLevel1Price 1402 … A typical trade record would look like, ColumnNames: Values: TimeStamp 20100930063000100 TradingCode ESH0 PriceQTY 10 Price 1402 I believe that gives me several possibilities to structure my set, ie: 1. Use a flat format similar to this: ColumnNames: Values(Depth): Values(Trade): TimeStamp 20100930063000100 20100930063000100 TradingCode ESH0 ESH0 TradeOrDepth D T PriceQty 0 10 Price 0 1402 BidLevel1QTY 10 0 BidLevel1Price 1400 0 ... AskLevel1QTY 13 0 AskLevel1Price 1402 0 … 2. Another option would be to use a relational format with indexes, ie Table 1: ColumnNames: Values: Values: UniqueID 100 101 TimeStamp 20100930063000100 20100930063000100 TradeOrDepth D T Table 2 (DepthsTrades table): ColumnNames: Values: Values: ForeignID 100 101 TradeQty 0 10 TradePrice 0 1402 BidPriceQty 10 0 BidPrice 1400 0 AskPriceQty 13 0 AskPrice 1402 0 Level 1 0 With option 1 I'd have quite a few redundancies, especially when the data comprises more than one level for a single timestamp. Up to 10 levels on both sides would be normal. Since it'd affect 10% of the data only it wouldn't be too bad though. With Option 2 I wouldn't have as many redundancies, but would have to join tables when querying for anything other than date_time ranges. Would that be a more (time) costly operation relative to option 1? I am pretty much mostly interested in getting the best retrieval speeds, plus from what I understand, FastBit's compression would be handling a bunch of zeros rather well storage size wise. I'd appreciate your thoughts. Regards Tobias _______________________________________________ FastBit-users mailing list [email protected] https://hpcrdm.lbl.gov/cgi-bin/mailman/listinfo/fastbit-users
