Hi! Thanks for the prompt answer. Actually I am not clear about switching from NxM array to N columns (64 in my case). How do I make a rectangular selection with columns? With an NxM array I just have to do arr[10000:20000,1:4] to select columns 1,2,3 and time samples 10000 to 20000.. While it is easy to manipulate integer indices, from what I've read columns would have to have string identifiers so I would be doing a lot of int<>str conversion?
My recorded data is so homogeneous (a huge 70Gb matrix of integers) that I am a bit lost among all the mixed-typing that seems to be the primary use-case behind columns. If I were to stay with the NxM array, would reads still be chunked? On the other hand, if I want to use arr[:,3] as a mask for another part of the array, is it more reasonable to have that be col3, in terms of pytables? I also get shivers when I see the looping constructs in the tutorial, mainly because I have learn to do only vectorized operations in numpy and never ever to write a python loop / list comprehension. Sorry for the newbie questions, feel free to point me to the FM (but see next email), -á. On Thu, Mar 15, 2012 at 17:57, Anthony Scopatz <scop...@gmail.com> wrote: > Hello Alvaro, > > Thanks for your excitement! > > On Thu, Mar 15, 2012 at 7:52 AM, Alvaro Tejero Cantero <alv...@minin.es> > wrote: >> >> Hi everybody! >> >> I plan to start using PyTables for an application at the University of >> Oxford where data is collected in sessions of 2Gb Int16 data organized >> as 64 parallel time series (64 detectors), each holding 15 million >> points (15M). >> >> I could handle this sessions separately, but ideally I would >> concatenate all of the sessions in a recording day, which are about up >> to 35. Thus I would have 70Gb of data to handle, to start with (i.e. >> before storing derivative data such as a band-pass filter over of the >> median of detectors 1 to 4). >> >> The way I interact with the data is to select some of these 64 >> channels and apply masks on them from logical conditions on the others >> (and potentially other generated columns of length 780x20.000) I >> sometime average across channels, so I thought it is better to store >> these 'sessions' as one large matrix of shape 64x15M instead of 64 >> independent columns. What do you think? > > > With these data sizes the overhead of columns is actually pretty minuscule. > Especially since you are going to be computing masks (probably from only > some of the columns), I would go ahead and just keep them in column form. > Storing the data as an NxM array means that you yourself need to keep > careful > track of the column indexes. It is easier to not worry about it. > >> >> The next question is whether it has an impact to operate on 35 2Gb >> sessions separately looping over them as compared to merging them and >> having one long array of 64x(35x15M). What I like of the long array is >> getting rid of the arbitrary session boundaries and apply logical >> masks over one of the 64 channels in full duration, i.e. over all 35 >> sessions concatenated. > > > This is really up to you and what makes sense in your mind. There are > advantages both ways. With long array you don't have to do any array > copies (merging) in memory. With the split up version, if there is an > error in one part, it doesn't take the whole calculation down with it. So > I would do what feels most natural. > > I hope this helps! > > Be Well > Anthony > >> I'd be very grateful for any advice of data layout for this amount of >> data. >> >> Thank you for Pytables, Francesc and the new governance team, >> >> Álvaro. >> >> >> ------------------------------------------------------------------------------ >> This SF email is sponsosred by: >> Try Windows Azure free for 90 days Click Here >> http://p.sf.net/sfu/sfd2d-msazure >> _______________________________________________ >> Pytables-users mailing list >> Pytables-users@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/pytables-users > > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pytables-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/pytables-users > ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users