Hi Peter, A Saturday 16 April 2011 08:08:33 Peter Vessenes escrigué: > Hi all, > > I am working on an energy trading system right now, and am looking at > pytables as a way to store some large multidimensional arrays for use > with numpy. > > The main data is stored as roughly 4,000 points * 24 hours * 365 days > * 6 items * 10 years or so, and can be fit into an int16 if I'm > willing to lose a little resolution. > > The algorithms are mostly vectorizable, but occasionally I need to > iterate through a few million rows and do some math that I can't > vectorize. The vectorized algorithms will hit pretty much every > datapoint during backtesting. > > So, here's my question, which I can't seem to find help for on the > pytables site -- what's the best way to store this data in pytables? > Should I be creating a custom pytables-style data structure, or > should I create a location in the pytables HDF5 file which stores a > compressed numpy array, maybe one per year or so, maybe everything > in one ginormous array?
IMO, it all boils down to your retrieval needs. If you are going to retrieve data always by *regular* slices, then a multidimensional array should be enough. For example, you can assign each a dimension for item, other for days, other for hours and other points (for an array 6x365x24x4000), and then use a different leaf for each year. Then retrieving the slice you are interested in is just a matter of using advanced slicing capabilities to access that. For example, in order to access a full-day of points, say the 243th of the year, you only have to do: my_slice = year_ds[:,243] and `my_slice` will be a 3-dimensional array with the desired data. However, in case you want more flexibility for accessing data than regular slices, many people find a Table object more easy to select data. For example, instead of making use of dimensions to access data, you can put a 'time' mark for every data point, as well as complement the time-series data points with other data (for example, each of the 6 items on your data can become a column). Then PyTables lets you do things like: lim1, lim2 = ..., ... for row in table.where('(sin(col3) > lim1) & (time_start > lim2)'): my_var = row['col1']**2 + row['col2'] As you see, this is intrinsically more flexible than trusting to a multidimensional structure for getting slices, and let you access column data very easily to further operate it, if you want so. However, there are drawbacks with this approach too: you need more space for keeping the 'time' mark and, you need to walk all the table just to get the interesting slice. The good news is that you can alleviate these issues significantly too. For example, the need for additional space is reduced by using compression (zlib, blosc or whatever supported library), which is very effective for regular time-steps like the above. And PyTables Pro removes the need to walk the entire table by using the OPSI indexing engine. And add to this equation HDF5's compact format, the ability to use multi-cores via numexpr (complex expression evaluator, which can optionally use Intel's MKL) and blosc (compressor), and you will get one of the best query performers for your tabular data. > My main focus right now is making sure that as much of the vector / > matrix math as possible can hit numpy quickly and ideally use the > multicore support there, or fall back to weave or inlined C if > necessary; secondary is ease of importing more data into the system. I would recommend using numexpr instead of weave because the former is: 1) multithreaded 2) do not need to be compiled (it comes with a JIT) 3) can make use of MKL for accelerating transcendent functions (sin, exp...) In case you would need more flexibility than provided by numexpr, I'd go with Cython, which is basically a language with Python syntax that can be compiled for getting maximum speed. > Right now, though, I'd just love some initial thoughts on best > practices / trade-offs. Pitches on Pytables Pro also appreciated. Well, I'll just say that PyTables Pro is being used in many different fields, from Aeronautics to Drug Discovery to large Travel Agencies. But I must say that it excels on handling time-series data, as you can see in the UBS Financial Market Data Overview (go to http://www.ubs.com/ and click on "Financial Market Data"). Needless to say, it is PyTables Pro who is delivering the data to the chart rendering process :) Hope it helps, -- Francesc Alted ------------------------------------------------------------------------------ Benefiting from Server Virtualization: Beyond Initial Workload Consolidation -- Increasing the use of server virtualization is a top priority.Virtualization can reduce costs, simplify management, and improve application availability and disaster protection. Learn more about boosting the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users