Re: [Pytables-users] Best Practice data design

Damien Klassen Tue, 13 Sep 2011 18:25:36 -0700

Thanks Anthony - yes the measures are distinct.

The two most common problems are: a) calculating moving averages, medians,
standard deviations, correlations across a time series for a single company
and b) joining companies and calculate (say) the sum of (Measure1 *
Measure2) grouped by date or (say) the median, average and standard
deviation of Measure 1 for a group of companies on a particular date


I suppose the question comes down to whether pytables is more efficient
with:

   -  more data by a factor of say 50-100x (plus assume a small overhead to
   create more data) but then fewer/simpler joins and largely vectorised
   calculations, or
   -  less data but more joins and calculations (maybe a few more iterative
   calculations needed)?

Related to that is your suggestion of a table for every company. That would
require a join (sometimes) across say 1,000 tables which (trapped in my SQL
mindset) I would have never considered - is that possible / efficient in
pytables?

Thanks again for all your help

Damien


---------- Forwarded message ----------
From: <pytables-users-requ...@lists.sourceforge.net>
Date: 14 September 2011 02:20
Subject: Pytables-users Digest, Vol 64, Issue 4
To: pytables-users@lists.sourceforge.net
Date: Tue, 13 Sep 2011 11:20:09 -0500
From: Anthony Scopatz <scop...@gmail.com>
Subject: Re: [Pytables-users] Best Practice data design
To: Discussion list for PyTables
       <pytables-users@lists.sourceforge.net>
Message-ID:
       <capk-6t6qzm+axmc22_muhzfbsrijiv_wjvo9vlwktrglpuj...@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

On Tue, Sep 13, 2011 at 7:56 AM, Damien Klassen <damien.klas...@gmail.com
>wrote:

> Hopefully an easy one for someone out there!
>
> Am trying to transfer some of my existing calculations from a relational
DB
> into pytables to speed up a range of calculations. One of the problems is
> that the data has irregular dates and I'm not sure where to do the
> processing.
>
> For example:
>
>
> Date            CompanyID     Measure1     Measure2 ..... Measure 20
> '20110912'         535              5.0                 15
> '20110911'         535              5.1                 12
> '20110910'         535              4.9                 11
> '20110909'         535              3.6                 20
> '20110912'         12                6.0                 34
> '20110911'         12                6.4                 34
> '20110910'         12                4.6                 36
> '20110909'         12                2.6                 35
>
> Date            CompanyID   Measure21 ........ Measure 30
> '20110901'         535                6.6
> '20110812'         535                6.1
> '20110901'         12                  6.0
> '20110814'         12                  6.3
>
> Date            CompanyID   Measure31 ...... Measure 100
> '20110911'         535                7.6
> '20110818'         535                7.1
> '20110903'         12                  7.0
> '20110818'         12                  7.3
>
> Usually I will be using the latest value for each measure until a newer
one
> presents itself (using scikits.timeseries to do the joins). While its not
> strictly true, for the purpose of the exercise I think we could assume the
> first table has daily data with 10-20 measures, the second has monthly
with
> 5-10 measures and the third has annual data with 50-70 measures.
>
> The calculations include a lot of aggregations with different splices of
> companies plus a lot of statistical measures on timeseries.
>
>
>
> I am thinking my options are:
>
>
> 1. Store in one big table with the minimum data i.e. CompanyID for my
first
> column and then an array for each column:
>
> eg column 2 might be Measure1 and 2, so for the example with company 535
> column 2 would be  [['20110912',5.0,15],['20110911',5.1,12],[
> '20110910',4.9,11],['20110909',3.6,20]]  say filled with daily data. Then
> column 3 might be a much smaller array filled with (approximately) monthly
> data, column 4 would be smaller again filled with irregular data (say
annual
> data)
>
> 2. Store in one huge table but process the data as it goes into the
> pytable so that the data is unique on company and the date and I fill all
> the empty spaces as it goes in. My rough estimate is that this will make
my
> database about 50-100 times larger than the first option (maybe 1Tb vs
> 10-20Gb). Final thought though is that I am guessing that as the annual
data
> is the same for 250 weekdays a year that it would compress pretty well...
> haven't looked at compression yet...
>
> 3. Store in multiple small tables and then try to join (I am assuming this
> one is wrong because its how I would do it in SQL!)
>
> 4 ?????
>
>
Hi Damien,

Are these measures always distinct between tables?  (ie you will never have
'Measure42' in more than one table.)  You are right in that option 2 will
have a lot of wasted space, which you probably want to avoid.

You could also store a table for each company and then have a MeasureID...

Company 5 Table
Time      MeasureID  Value
1              16               64.9
2               17              3.6

How you store the data with this kind of data set (and where you do the
joins) is really a function of what your most important use case is.  You
get the greatest advantage out of HDF5 when your data is structured and well
formed.  However, this is not always the case, and I have been known to use
some ideas like in option 3 ;).

Be Well
Anthony


> Thanks for any help - am hoping to get the data design right!
>
> Damien
>
>
>
>
>
>
------------------------------------------------------------------------------
> BlackBerry&reg; DevCon Americas, Oct. 18-20, San Francisco, CA
> Learn about the latest advances in developing for the
> BlackBerry&reg; mobile platform with sessions, labs & more.
> See new tools and technologies. Register for BlackBerry&reg; DevCon today!
> http://p.sf.net/sfu/rim-devcon-copy1
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...

------------------------------

------------------------------------------------------------------------------
BlackBerry&reg; DevCon Americas, Oct. 18-20, San Francisco, CA
Learn about the latest advances in developing for the
BlackBerry&reg; mobile platform with sessions, labs & more.
See new tools and technologies. Register for BlackBerry&reg; DevCon today!
http://p.sf.net/sfu/rim-devcon-copy1

------------------------------

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


End of Pytables-users Digest, Vol 64, Issue 4
*********************************************

------------------------------------------------------------------------------
Using storage to extend the benefits of virtualization and iSCSI
Virtualization increases hardware utilization and delivers a new level of
agility. Learn what those decisions are and how to modernize your storage 
and backup environments for virtualization.
http://www.accelacomm.com/jaw/sfnl/114/51434361/

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Best Practice data design

Reply via email to