Re: [Pytables-users] Best Practice data design

Anthony Scopatz Tue, 13 Sep 2011 09:20:58 -0700

On Tue, Sep 13, 2011 at 7:56 AM, Damien Klassen <damien.klas...@gmail.com>wrote:


> Hopefully an easy one for someone out there!
>
> Am trying to transfer some of my existing calculations from a relational DB
> into pytables to speed up a range of calculations. One of the problems is
> that the data has irregular dates and I'm not sure where to do the
> processing.
>
> For example:
>
>
> Date            CompanyID     Measure1     Measure2 ..... Measure 20
> '20110912'         535              5.0                 15
> '20110911'         535              5.1                 12
> '20110910'         535              4.9                 11
> '20110909'         535              3.6                 20
> '20110912'         12                6.0                 34
> '20110911'         12                6.4                 34
> '20110910'         12                4.6                 36
> '20110909'         12                2.6                 35
>
> Date            CompanyID   Measure21 ........ Measure 30
> '20110901'         535                6.6
> '20110812'         535                6.1
> '20110901'         12                  6.0
> '20110814'         12                  6.3
>
> Date            CompanyID   Measure31 ...... Measure 100
> '20110911'         535                7.6
> '20110818'         535                7.1
> '20110903'         12                  7.0
> '20110818'         12                  7.3
>
> Usually I will be using the latest value for each measure until a newer one
> presents itself (using scikits.timeseries to do the joins). While its not
> strictly true, for the purpose of the exercise I think we could assume the
> first table has daily data with 10-20 measures, the second has monthly with
> 5-10 measures and the third has annual data with 50-70 measures.
>
> The calculations include a lot of aggregations with different splices of
> companies plus a lot of statistical measures on timeseries.
>
>
>
> I am thinking my options are:
>
>
> 1. Store in one big table with the minimum data i.e. CompanyID for my first
> column and then an array for each column:
>
> eg column 2 might be Measure1 and 2, so for the example with company 535
> column 2 would be  [['20110912',5.0,15],['20110911',5.1,12],[
> '20110910',4.9,11],['20110909',3.6,20]]  say filled with daily data. Then
> column 3 might be a much smaller array filled with (approximately) monthly
> data, column 4 would be smaller again filled with irregular data (say annual
> data)
>
> 2. Store in one huge table but process the data as it goes into the
> pytable so that the data is unique on company and the date and I fill all
> the empty spaces as it goes in. My rough estimate is that this will make my
> database about 50-100 times larger than the first option (maybe 1Tb vs
> 10-20Gb). Final thought though is that I am guessing that as the annual data
> is the same for 250 weekdays a year that it would compress pretty well...
> haven't looked at compression yet...
>
> 3. Store in multiple small tables and then try to join (I am assuming this
> one is wrong because its how I would do it in SQL!)
>
> 4 ?????
>
>
Hi Damien,

Are these measures always distinct between tables?  (ie you will never have
'Measure42' in more than one table.)  You are right in that option 2 will
have a lot of wasted space, which you probably want to avoid.

You could also store a table for each company and then have a MeasureID...

Company 5 Table
Time      MeasureID  Value
1              16               64.9
2               17              3.6

How you store the data with this kind of data set (and where you do the
joins) is really a function of what your most important use case is.  You
get the greatest advantage out of HDF5 when your data is structured and well
formed.  However, this is not always the case, and I have been known to use
some ideas like in option 3 ;).

Be Well
Anthony


> Thanks for any help - am hoping to get the data design right!
>
> Damien
>
>
>
>
>
> ------------------------------------------------------------------------------
> BlackBerry&reg; DevCon Americas, Oct. 18-20, San Francisco, CA
> Learn about the latest advances in developing for the
> BlackBerry&reg; mobile platform with sessions, labs & more.
> See new tools and technologies. Register for BlackBerry&reg; DevCon today!
> http://p.sf.net/sfu/rim-devcon-copy1
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

------------------------------------------------------------------------------
BlackBerry&reg; DevCon Americas, Oct. 18-20, San Francisco, CA
Learn about the latest advances in developing for the 
BlackBerry&reg; mobile platform with sessions, labs & more.
See new tools and technologies. Register for BlackBerry&reg; DevCon today!
http://p.sf.net/sfu/rim-devcon-copy1

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Best Practice data design

Reply via email to