Re: [Hdf-forum] Best structure for an Hdf5 file to store historical data and how to access quickly from python

Pope, Stephen (Prediction) Mon, 27 Oct 2014 08:41:00 -0700

“the best” of course depends very strongly on your use cases.

We’ve been using HDF5 for many years to store and access intraday market data; 
it is very effective when the number of ticks (i.e. trades, quotes, or book 
updates) per dataset is sufficiently high; otherwise you will drown in the 
overhead. This is easily achieved with the usual equity and futures data, but 
requires care with options data, as the majority of individual options series 
have very little data (something like an average of 10 trades/series/day for 
consolidated OPRA for those series with at least one trade on a given day).


In our experience (and applied to our particular usage), it is best to 
aggregate into a single HDF5 file all the series by date and base ticker (or 
almost equivalently the underlying symbol), but store ordered by the series and 
then time. You can then store (in attributes) a map from individual series 
ticker (osi or whatever) or equivalently (expiration, putcall, strike) to an 
offset and extent within the single set of columns. Note you wouldn’t want to 
store the metadata about each series (symbol, expiration, putcall, strike) as 
data columns, but rather as attributes indexed as with the map.

This sort of layout, combined with the necessary reference data external to 
HDF5 to help you find the right HDF5 for a given date and base symbol or series 
symbol, then enables you to retrieve the data for all series for the base 
symbol (along with the necessary indexing/attribute metadata) in a very 
efficient fashion; and with a little bit of API work, also allows you to pull 
the data out on a series-by-series basis by constructing the appropriate 
hyperslabs based on the offset/extent attributes mentioned above, when you are 
only after a small number of individual series.

In our work benchmarking this approach (accessing and processing the data from 
Java), we were able to get to within 2-4x slower than using KDB while 
performing the typical sort of binning and joining with underlying data 
operations, and achieve essentially the same level of storage efficiency.

I cannot speak to the Python side of your question. But this approach works 
well for us from C/C++/Java/Julia, and spares those who cannot stomach Q from 
having to use KDB!

stephen

From: Hdf-forum [mailto:[email protected]] On Behalf Of Dan E
Sent: Sunday, October 26, 2014 12:54 PM
To: [email protected]
Subject: [Hdf-forum] Best structure for an Hdf5 file to store historical data 
and how to access quickly from python


I have equity options historical data, now in csv files - one file for each 
day, that I'd like to store in one or multiple h5 files for efficient resource 
usage and fast access. One row of data is about a dozen columns (date, symbol 
name, options characteristics (maturiy, exercise price etc..), price 
(open/high/low/close), volume + some other infos). One day is about 700,000 
rows accross 4,000 different symbols.

Now here's the thing, I'd like to be able to quickly retrieve the data in "both 
directions", meaning: - request for a specific day, all options on a particular 
symbol or list of symbols - request for a specific option (or option list) the 
time series of historical prices between two dates.

What would be the ideal structure of the data? One vs multiple files? Group 
hierarchy? Python modules you recommend for accessing the data? (ideally pandas)

I tried a first implementation creating for each day a Pandas DataFrame indexed 
by symbol name and then storing it under a separate h5 file under one root 
group 'OPTIONS' Then to access the data on a specific day for a specific symbol 
list, I would do :

h5 = pd.HDFStore(filename)

opt_df = h5['OPTIONS'].ix[symbol_list]

I would then slice opt_df using masks to finally get the options I really want.

The problem though is: - I usually need only to access a small fraction of the 
4,000 symbols (typically a couple at a time) and with this method it loads the 
whole daily file in a DataFrame which causes significant overhead - Although it 
is reasonably fast if I just need data for a specific day, it becomes extremely 
painful if I want to access a specific contract over a period of time as I 
would then have to load the file for each of the requested dates.

How would you implement this? Also keep in mind that I am constantly adding 
more data (every day).

Thanks a lot for your help

######################################################################
The information contained in this communication is confidential and
may contain information that is privileged or exempt from disclosure
under applicable law. If you are not a named addressee, please notify
the sender immediately and delete this email from your system.
If you have received this communication, and are not a named
recipient, you are hereby notified that any dissemination,
distribution or copying of this communication is strictly prohibited.
######################################################################

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Best structure for an Hdf5 file to store historical data and how to access quickly from python

Reply via email to