Hi everybody! I plan to start using PyTables for an application at the University of Oxford where data is collected in sessions of 2Gb Int16 data organized as 64 parallel time series (64 detectors), each holding 15 million points (15M).
I could handle this sessions separately, but ideally I would concatenate all of the sessions in a recording day, which are about up to 35. Thus I would have 70Gb of data to handle, to start with (i.e. before storing derivative data such as a band-pass filter over of the median of detectors 1 to 4). The way I interact with the data is to select some of these 64 channels and apply masks on them from logical conditions on the others (and potentially other generated columns of length 780x20.000) I sometime average across channels, so I thought it is better to store these 'sessions' as one large matrix of shape 64x15M instead of 64 independent columns. What do you think? The next question is whether it has an impact to operate on 35 2Gb sessions separately looping over them as compared to merging them and having one long array of 64x(35x15M). What I like of the long array is getting rid of the arbitrary session boundaries and apply logical masks over one of the 64 channels in full duration, i.e. over all 35 sessions concatenated. I'd be very grateful for any advice of data layout for this amount of data. Thank you for Pytables, Francesc and the new governance team, Álvaro. ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users