Re: [Pytables-users] Best Practice data design

Anthony Scopatz Tue, 13 Sep 2011 20:14:31 -0700

Hi Damien,

Unfortunately, it seems to me that you are in the situation where it is not
*really* possible to store and access your data with optimum efficiency.
 This is because your data is ragged.  So your job is to decide how to make
these trade offs.  You have to decide which parts of your data set are most
regular or most often accessed and then group these together.  The other
lesser accessed stuff can be accessed more infrequently with joins and more
or less treated as meta-data.


On Tue, Sep 13, 2011 at 8:24 PM, Damien Klassen <damien.klas...@gmail.com>wrote:

>
> Thanks Anthony - yes the measures are distinct.
>
> The two most common problems are: a) calculating moving averages, medians,
> standard deviations, correlations across a time series for a single company
> and b) joining companies and calculate (say) the sum of (Measure1 *
> Measure2) grouped by date or (say) the median, average and standard
> deviation of Measure 1 for a group of companies on a particular date
>
> I suppose the question comes down to whether pytables is more efficient
> with:
>
>    -  more data by a factor of say 50-100x (plus assume a small overhead
>    to create more data) but then fewer/simpler joins and largely vectorised
>    calculations, or
>    -  less data but more joins and calculations (maybe a few more
>    iterative calculations needed)?
>
> Related to that is your suggestion of a table for every company. That would
> require a join (sometimes) across say 1,000 tables which (trapped in my SQL
> mindset) I would have never considered - is that possible / efficient in
> pytables?
>

No, that isn't going to be efficient anywhere ;).  But it might be faster in
PyTables / HDF5 than SQL.    I would go ahead and create test cases using
both methodologies and compare how they do for various operations that you
care about (space used, timing on mean, std, etc.)

One hint for the option 1 with the single giant table is that if you are
only accessing a couple of columns at a time, you should use the
table.cols.XXX interface.  This will only read in those columns of data,
rather than all columns and selecting out the ones you need.

If you do this comparison, please report back!  I'd love to know how it
ended up...

Be Well
Anthony


>
> Thanks again for all your help
>
> Damien
>
>
> ---------- Forwarded message ----------
> From: <pytables-users-requ...@lists.sourceforge.net>
> Date: 14 September 2011 02:20
> Subject: Pytables-users Digest, Vol 64, Issue 4
> To: pytables-users@lists.sourceforge.net
> Date: Tue, 13 Sep 2011 11:20:09 -0500
> From: Anthony Scopatz <scop...@gmail.com>
> Subject: Re: [Pytables-users] Best Practice data design
> To: Discussion list for PyTables
>        <pytables-users@lists.sourceforge.net>
> Message-ID:
>        <capk-6t6qzm+axmc22_muhzfbsrijiv_wjvo9vlwktrglpuj...@mail.gmail.com
> >
> Content-Type: text/plain; charset="iso-8859-1"
>
>
> On Tue, Sep 13, 2011 at 7:56 AM, Damien Klassen <damien.klas...@gmail.com
> >wrote:
>
> > Hopefully an easy one for someone out there!
> >
> > Am trying to transfer some of my existing calculations from a relational
> DB
> > into pytables to speed up a range of calculations. One of the problems is
> > that the data has irregular dates and I'm not sure where to do the
> > processing.
> >
> > For example:
> >
> >
> > Date            CompanyID     Measure1     Measure2 ..... Measure 20
> > '20110912'         535              5.0                 15
> > '20110911'         535              5.1                 12
> > '20110910'         535              4.9                 11
> > '20110909'         535              3.6                 20
> > '20110912'         12                6.0                 34
> > '20110911'         12                6.4                 34
> > '20110910'         12                4.6                 36
> > '20110909'         12                2.6                 35
> >
> > Date            CompanyID   Measure21 ........ Measure 30
> > '20110901'         535                6.6
> > '20110812'         535                6.1
> > '20110901'         12                  6.0
> > '20110814'         12                  6.3
> >
> > Date            CompanyID   Measure31 ...... Measure 100
> > '20110911'         535                7.6
> > '20110818'         535                7.1
> > '20110903'         12                  7.0
> > '20110818'         12                  7.3
> >
> > Usually I will be using the latest value for each measure until a newer
> one
> > presents itself (using scikits.timeseries to do the joins). While its not
> > strictly true, for the purpose of the exercise I think we could assume
> the
> > first table has daily data with 10-20 measures, the second has monthly
> with
> > 5-10 measures and the third has annual data with 50-70 measures.
> >
> > The calculations include a lot of aggregations with different splices of
> > companies plus a lot of statistical measures on timeseries.
> >
> >
> >
> > I am thinking my options are:
> >
> >
> > 1. Store in one big table with the minimum data i.e. CompanyID for my
> first
> > column and then an array for each column:
> >
> > eg column 2 might be Measure1 and 2, so for the example with company 535
> > column 2 would be  [['20110912',5.0,15],['20110911',5.1,12],[
> > '20110910',4.9,11],['20110909',3.6,20]]  say filled with daily data. Then
> > column 3 might be a much smaller array filled with (approximately)
> monthly
> > data, column 4 would be smaller again filled with irregular data (say
> annual
> > data)
> >
> > 2. Store in one huge table but process the data as it goes into the
> > pytable so that the data is unique on company and the date and I fill all
> > the empty spaces as it goes in. My rough estimate is that this will make
> my
> > database about 50-100 times larger than the first option (maybe 1Tb vs
> > 10-20Gb). Final thought though is that I am guessing that as the annual
> data
> > is the same for 250 weekdays a year that it would compress pretty well...
> > haven't looked at compression yet...
> >
> > 3. Store in multiple small tables and then try to join (I am assuming
> this
> > one is wrong because its how I would do it in SQL!)
> >
> > 4 ?????
> >
> >
> Hi Damien,
>
> Are these measures always distinct between tables?  (ie you will never have
> 'Measure42' in more than one table.)  You are right in that option 2 will
> have a lot of wasted space, which you probably want to avoid.
>
> You could also store a table for each company and then have a MeasureID...
>
> Company 5 Table
> Time      MeasureID  Value
> 1              16               64.9
> 2               17              3.6
>
> How you store the data with this kind of data set (and where you do the
> joins) is really a function of what your most important use case is.  You
> get the greatest advantage out of HDF5 when your data is structured and
> well
> formed.  However, this is not always the case, and I have been known to use
> some ideas like in option 3 ;).
>
> Be Well
> Anthony
>
>
> > Thanks for any help - am hoping to get the data design right!
> >
> > Damien
> >
> >
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > BlackBerry&reg; DevCon Americas, Oct. 18-20, San Francisco, CA
> > Learn about the latest advances in developing for the
> > BlackBerry&reg; mobile platform with sessions, labs & more.
> > See new tools and technologies. Register for BlackBerry&reg; DevCon
> today!
> > http://p.sf.net/sfu/rim-devcon-copy1
> > _______________________________________________
> > Pytables-users mailing list
> > Pytables-users@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/pytables-users
> >
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
>
> ------------------------------
>
>
>
> ------------------------------------------------------------------------------
> BlackBerry&reg; DevCon Americas, Oct. 18-20, San Francisco, CA
> Learn about the latest advances in developing for the
> BlackBerry&reg; mobile platform with sessions, labs & more.
> See new tools and technologies. Register for BlackBerry&reg; DevCon today!
> http://p.sf.net/sfu/rim-devcon-copy1
>
> ------------------------------
>
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
> End of Pytables-users Digest, Vol 64, Issue 4
> *********************************************
>
>
>
> ------------------------------------------------------------------------------
> Using storage to extend the benefits of virtualization and iSCSI
> Virtualization increases hardware utilization and delivers a new level of
> agility. Learn what those decisions are and how to modernize your storage
> and backup environments for virtualization.
> http://www.accelacomm.com/jaw/sfnl/114/51434361/
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>

------------------------------------------------------------------------------
BlackBerry&reg; DevCon Americas, Oct. 18-20, San Francisco, CA
Learn about the latest advances in developing for the 
BlackBerry&reg; mobile platform with sessions, labs & more.
See new tools and technologies. Register for BlackBerry&reg; DevCon today!
http://p.sf.net/sfu/rim-devcon-copy1

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Best Practice data design

Reply via email to