Hello Alvaro,
Thanks for your excitement!
On Thu, Mar 15, 2012 at 7:52 AM, Alvaro Tejero Cantero <alv...@minin.es>wrote:
> Hi everybody!
>
> I plan to start using PyTables for an application at the University of
> Oxford where data is collected in sessions of 2Gb Int16 data organized
> as 64 parallel time series (64 detectors), each holding 15 million
> points (15M).
>
> I could handle this sessions separately, but ideally I would
> concatenate all of the sessions in a recording day, which are about up
> to 35. Thus I would have 70Gb of data to handle, to start with (i.e.
> before storing derivative data such as a band-pass filter over of the
> median of detectors 1 to 4).
>
> The way I interact with the data is to select some of these 64
> channels and apply masks on them from logical conditions on the others
> (and potentially other generated columns of length 780x20.000) I
> sometime average across channels, so I thought it is better to store
> these 'sessions' as one large matrix of shape 64x15M instead of 64
> independent columns. What do you think?
>
With these data sizes the overhead of columns is actually pretty minuscule.
Especially since you are going to be computing masks (probably from only
some of the columns), I would go ahead and just keep them in column form.
Storing the data as an NxM array means that you yourself need to keep
careful
track of the column indexes. It is easier to not worry about it.
> The next question is whether it has an impact to operate on 35 2Gb
> sessions separately looping over them as compared to merging them and
> having one long array of 64x(35x15M). What I like of the long array is
> getting rid of the arbitrary session boundaries and apply logical
> masks over one of the 64 channels in full duration, i.e. over all 35
> sessions concatenated.
>
This is really up to you and what makes sense in your mind. There are
advantages both ways. With long array you don't have to do any array
copies (merging) in memory. With the split up version, if there is an
error in one part, it doesn't take the whole calculation down with it. So
I would do what feels most natural.
I hope this helps!
Be Well
Anthony
I'd be very grateful for any advice of data layout for this amount of data.
>
> Thank you for Pytables, Francesc and the new governance team,
>
> Álvaro.
>
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users