Re: [Pytables-users] Help structuring data?

Francesc Alted Sun, 17 Apr 2011 03:47:29 -0700

Hi Peter,

A Saturday 16 April 2011 08:08:33 Peter Vessenes escrigué:
> Hi all,
> 
> I am working on an energy trading system right now, and am looking at
> pytables as a way to store some large multidimensional arrays for use
> with numpy.
> 
> The main data is stored as roughly 4,000 points * 24 hours * 365 days
> * 6 items * 10 years or so, and can be fit into an int16 if I'm
> willing to lose a little resolution.
> 
> The algorithms are mostly vectorizable, but occasionally I need to
> iterate through a few million rows and do some math that I can't
> vectorize. The vectorized algorithms will hit pretty much every
> datapoint during backtesting.
> 
> So, here's my question, which I can't seem to find help for on the
> pytables site -- what's the best way to store this data in pytables?
> Should I be creating a custom pytables-style data structure, or
> should I create a location in the pytables HDF5 file which stores a
> compressed numpy array, maybe one per year or so, maybe everything
> in one ginormous array?


IMO, it all boils down to your retrieval needs.  If you are going to 
retrieve data always by *regular* slices, then a multidimensional array 
should be enough.  For example, you can assign each a dimension for 
item, other for days, other for hours and other points (for an array 
6x365x24x4000), and then use a different leaf for each year.  Then 
retrieving the slice you are interested in is just a matter of using 
advanced slicing capabilities to access that.  For example, in order to 
access a full-day of points, say the 243th of the year, you only have to 
do:

  my_slice = year_ds[:,243]

and `my_slice` will be a 3-dimensional array with the desired data.

However, in case you want more flexibility for accessing data than 
regular slices, many people find a Table object more easy to select 
data.  For example, instead of making use of dimensions to access data, 
you can put a 'time' mark for every data point, as well as complement 
the time-series data points with other data (for example, each of the 6 
items on your data can become a column).  Then PyTables lets you do 
things like:

  lim1, lim2 = ..., ...
  for row in table.where('(sin(col3) > lim1) & (time_start > lim2)'):
      my_var = row['col1']**2 + row['col2']

As you see, this is intrinsically more flexible than trusting to a 
multidimensional structure for getting slices, and let you access column 
data very easily to further operate it, if you want so.

However, there are drawbacks with this approach too: you need more space 
for keeping the 'time' mark and, you need to walk all the table just to 
get the interesting slice.  The good news is that you can alleviate 
these issues significantly too.  For example, the need for additional 
space is reduced by using compression (zlib, blosc or whatever supported 
library), which is very effective for regular time-steps like the above.  
And PyTables Pro removes the need to walk the entire table by using the 
OPSI indexing engine.  And add to this equation HDF5's compact format, 
the ability to use multi-cores via numexpr (complex expression 
evaluator, which can optionally use Intel's MKL) and blosc (compressor), 
and you will get one of the best query performers for your tabular data.

> My main focus right now is making sure that as much of the vector /
> matrix math as possible can hit numpy quickly and ideally use the
> multicore support there, or fall back to weave or inlined C if
> necessary; secondary is ease of importing more data into the system.

I would recommend using numexpr instead of weave because the former is:

1) multithreaded
2) do not need to be compiled (it comes with a JIT)
3) can make use of MKL for accelerating transcendent functions (sin, 
exp...)

In case you would need more flexibility than provided by numexpr, I'd go 
with Cython, which is basically a language with Python syntax that can 
be compiled for getting maximum speed.
 
> Right now, though, I'd just love some initial thoughts on best
> practices / trade-offs. Pitches on Pytables Pro also appreciated.

Well, I'll just say that PyTables Pro is being used in many different 
fields, from Aeronautics to Drug Discovery to large Travel Agencies.  
But I must say that it excels on handling time-series data, as you can 
see in the UBS Financial Market Data Overview (go to http://www.ubs.com/ 
and click on "Financial Market Data").  Needless to say, it is PyTables 
Pro who is delivering the data to the chart rendering process :)

Hope it helps,

-- 
Francesc Alted

------------------------------------------------------------------------------
Benefiting from Server Virtualization: Beyond Initial Workload 
Consolidation -- Increasing the use of server virtualization is a top
priority.Virtualization can reduce costs, simplify management, and improve 
application availability and disaster protection. Learn more about boosting 
the value of server virtualization. http://p.sf.net/sfu/vmware-sfdev2dev
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Help structuring data?

Reply via email to