Hi everybody!

I plan to start using PyTables for an application at the University of
Oxford where data is collected in sessions of 2Gb Int16 data organized
as 64 parallel time series (64 detectors), each holding 15 million
points (15M).

I could handle this sessions separately, but ideally I would
concatenate all of the sessions in a recording day, which are about up
to 35. Thus I would have 70Gb of data to handle, to start with (i.e.
before storing derivative data such as a band-pass filter over of the
median of detectors 1 to 4).

The way I interact with the data is to select some of these 64
channels and apply masks on them from logical conditions on the others
(and potentially other generated columns of length 780x20.000) I
sometime average across channels, so I thought it is better to store
these 'sessions' as one large matrix of shape 64x15M instead of 64
independent columns. What do you think?

The next question is whether it has an impact to operate on 35 2Gb
sessions separately looping over them as compared to merging them and
having one long array of 64x(35x15M). What I like of the long array is
getting rid of the arbitrary session boundaries and apply logical
masks over one of the 64 channels in full duration, i.e. over all 35
sessions concatenated.

I'd be very grateful for any advice of data layout for this amount of data.

Thank you for Pytables, Francesc and the new governance team,

Álvaro.

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to