[Pytables-users] Best Practice data design

Damien Klassen Tue, 13 Sep 2011 05:57:23 -0700

Hopefully an easy one for someone out there!

Am trying to transfer some of my existing calculations from a relational DB
into pytables to speed up a range of calculations. One of the problems is
that the data has irregular dates and I'm not sure where to do the
processing.


For example:


Date            CompanyID     Measure1     Measure2 ..... Measure 20
'20110912'         535              5.0                 15
'20110911'         535              5.1                 12
'20110910'         535              4.9                 11
'20110909'         535              3.6                 20
'20110912'         12                6.0                 34
'20110911'         12                6.4                 34
'20110910'         12                4.6                 36
'20110909'         12                2.6                 35

Date            CompanyID   Measure21 ........ Measure 30
'20110901'         535                6.6
'20110812'         535                6.1
'20110901'         12                  6.0
'20110814'         12                  6.3

Date            CompanyID   Measure31 ...... Measure 100
'20110911'         535                7.6
'20110818'         535                7.1
'20110903'         12                  7.0
'20110818'         12                  7.3

Usually I will be using the latest value for each measure until a newer one
presents itself (using scikits.timeseries to do the joins). While its not
strictly true, for the purpose of the exercise I think we could assume the
first table has daily data with 10-20 measures, the second has monthly with
5-10 measures and the third has annual data with 50-70 measures.

The calculations include a lot of aggregations with different splices of
companies plus a lot of statistical measures on timeseries.



I am thinking my options are:


1. Store in one big table with the minimum data i.e. CompanyID for my first
column and then an array for each column:

eg column 2 might be Measure1 and 2, so for the example with company 535
column 2 would be  [['20110912',5.0,15],['20110911',5.1,12],[
'20110910',4.9,11],['20110909',3.6,20]]  say filled with daily data. Then
column 3 might be a much smaller array filled with (approximately) monthly
data, column 4 would be smaller again filled with irregular data (say annual
data)

2. Store in one huge table but process the data as it goes into the pytable
so that the data is unique on company and the date and I fill all the empty
spaces as it goes in. My rough estimate is that this will make my database
about 50-100 times larger than the first option (maybe 1Tb vs 10-20Gb).
Final thought though is that I am guessing that as the annual data is the
same for 250 weekdays a year that it would compress pretty well... haven't
looked at compression yet...

3. Store in multiple small tables and then try to join (I am assuming this
one is wrong because its how I would do it in SQL!)

4 ?????

Thanks for any help - am hoping to get the data design right!

Damien

------------------------------------------------------------------------------
BlackBerry&reg; DevCon Americas, Oct. 18-20, San Francisco, CA
Learn about the latest advances in developing for the 
BlackBerry&reg; mobile platform with sessions, labs & more.
See new tools and technologies. Register for BlackBerry&reg; DevCon today!
http://p.sf.net/sfu/rim-devcon-copy1

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

[Pytables-users] Best Practice data design

Reply via email to