Hi,

Did you try to hash all strings into numerics, which you can use later for
indexing (symbols); name data blocks for example: 2014-01-01.mat ?

Adding chunking and setting the compression filter high covers performance
versus space complexity.

The data blocks should be layed out carefully: irregular series treated
differently from regular time series as indexing the later is a breeze.

For irregular time series you may use hdf5 custom data type with pocket
tables, this allows high speed serial access. For regular time series n
dimensional slabs of some data type fits better. As an example for candle
data: instrument:time:OHLC can do well with float type.

My set up is somewhat different from yours as it includes a parallel
filesystem + cluster +MPI+ phdf5 C api. The data size is much same.
Throughput/space is superb to SQL based approach.

Steve
On Oct 27, 2014 8:46 AM, "Dan E" <[email protected]> wrote:

> I have equity options historical data, now in csv files - one file for
> each day, that I'd like to store in one or multiple h5 files for efficient
> resource usage and fast access. One row of data is about a dozen columns
> (date, symbol name, options characteristics (maturiy, exercise price
> etc..), price (open/high/low/close), volume + some other infos). One day is
> about 700,000 rows accross 4,000 different symbols.
>
> Now here's the thing, I'd like to be able to quickly retrieve the data in
> "both directions", meaning: - request for a specific day, all options on a
> particular symbol or list of symbols - request for a specific option (or
> option list) the time series of historical prices between two dates.
>
> What would be the ideal structure of the data? One vs multiple files?
> Group hierarchy? Python modules you recommend for accessing the data?
> (ideally pandas)
>
> I tried a first implementation creating for each day a Pandas DataFrame
> indexed by symbol name and then storing it under a separate h5 file under
> one root group 'OPTIONS' Then to access the data on a specific day for a
> specific symbol list, I would do :
>
> h5 = pd.HDFStore(filename)
>
> opt_df = h5['OPTIONS'].ix[symbol_list]
>
> I would then slice opt_df using masks to finally get the options I really
> want.
>
> The problem though is: - I usually need only to access a small fraction of
> the 4,000 symbols (typically a couple at a time) and with this method it
> loads the whole daily file in a DataFrame which causes significant overhead
> - Although it is reasonably fast if I just need data for a specific day, it
> becomes extremely painful if I want to access a specific contract over a
> period of time as I would then have to load the file for each of the
> requested dates.
>
> How would you implement this? Also keep in mind that I am constantly
> adding more data (every day).
>
> Thanks a lot for your help
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
>
> http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
>
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Reply via email to