Hi!

Thanks for the prompt answer. Actually I am not clear about switching
from NxM array to N columns (64 in my case). How do I make a
rectangular selection with columns? With an NxM array I just have to
do arr[10000:20000,1:4] to select columns 1,2,3 and time samples 10000
to 20000.. While it is easy to manipulate integer indices, from what
I've read columns would have to have string identifiers so I would be
doing a lot of int<>str conversion?

My recorded data is so homogeneous (a huge 70Gb matrix of integers)
that I am a bit lost among all the mixed-typing that seems to be the
primary use-case behind columns. If I were to stay with the NxM array,
would reads still be chunked?

On the other hand, if I want to use arr[:,3] as a mask for another
part of the array, is it more reasonable to have that be col3, in
terms of pytables?

I also get shivers when I see the looping constructs in the tutorial,
mainly because I have learn to do only vectorized operations in numpy
and never ever to write a python loop / list comprehension.

Sorry for the newbie questions, feel free to point me to the FM (but
see next email),

-á.



On Thu, Mar 15, 2012 at 17:57, Anthony Scopatz <scop...@gmail.com> wrote:
> Hello Alvaro,
>
> Thanks for your excitement!
>
> On Thu, Mar 15, 2012 at 7:52 AM, Alvaro Tejero Cantero <alv...@minin.es>
> wrote:
>>
>> Hi everybody!
>>
>> I plan to start using PyTables for an application at the University of
>> Oxford where data is collected in sessions of 2Gb Int16 data organized
>> as 64 parallel time series (64 detectors), each holding 15 million
>> points (15M).
>>
>> I could handle this sessions separately, but ideally I would
>> concatenate all of the sessions in a recording day, which are about up
>> to 35. Thus I would have 70Gb of data to handle, to start with (i.e.
>> before storing derivative data such as a band-pass filter over of the
>> median of detectors 1 to 4).
>>
>> The way I interact with the data is to select some of these 64
>> channels and apply masks on them from logical conditions on the others
>> (and potentially other generated columns of length 780x20.000) I
>> sometime average across channels, so I thought it is better to store
>> these 'sessions' as one large matrix of shape 64x15M instead of 64
>> independent columns. What do you think?
>
>
> With these data sizes the overhead of columns is actually pretty minuscule.
> Especially since you are going to be computing masks (probably from only
> some of the columns), I would go ahead and just keep them in column form.
> Storing the data as an NxM array means that you yourself need to keep
> careful
> track of the column indexes.  It is easier to not worry about it.
>
>>
>> The next question is whether it has an impact to operate on 35 2Gb
>> sessions separately looping over them as compared to merging them and
>> having one long array of 64x(35x15M). What I like of the long array is
>> getting rid of the arbitrary session boundaries and apply logical
>> masks over one of the 64 channels in full duration, i.e. over all 35
>> sessions concatenated.
>
>
> This is really up to you and what makes sense in your mind.  There are
> advantages both ways.  With long array you don't have to do any array
> copies (merging) in memory.  With the split up version, if there is an
> error in one part, it doesn't take the whole calculation down with it.  So
> I would do what feels most natural.
>
> I hope this helps!
>
> Be Well
> Anthony
>
>> I'd be very grateful for any advice of data layout for this amount of
>> data.
>>
>> Thank you for Pytables, Francesc and the new governance team,
>>
>> Álvaro.
>>
>>
>> ------------------------------------------------------------------------------
>> This SF email is sponsosred by:
>> Try Windows Azure free for 90 days Click Here
>> http://p.sf.net/sfu/sfd2d-msazure
>> _______________________________________________
>> Pytables-users mailing list
>> Pytables-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to