Re: [Pytables-users] Advice for new user

Anthony Scopatz Thu, 15 Mar 2012 11:44:15 -0700

Hello Alvaro

On Thu, Mar 15, 2012 at 1:20 PM, Alvaro Tejero Cantero <alv...@minin.es>wrote:


> Hi!
>
> Thanks for the prompt answer. Actually I am not clear about switching
> from NxM array to N columns (64 in my case). How do I make a
> rectangular selection with columns? With an NxM array I just have to
> do arr[10000:20000,1:4] to select columns 1,2,3 and time samples 10000
> to 20000..


Tables are really a 1D array of C-structs. They are basically equivalent in
many
ways to numpy structured arrays:
http://docs.scipy.org/doc/numpy/user/basics.rec.html
So there is no analogy to the 2D slice that you mention above.


> While it is easy to manipulate integer indices, from what
> I've read columns would have to have string identifiers so I would be
> doing a lot of int<>str conversion?
>

No, you don't do a lot of str - int conversions.  The strs represent field
names
and only incidentally indexes.

My recorded data is so homogeneous (a huge 70Gb matrix of integers)
> that I am a bit lost among all the mixed-typing that seems to be the
> primary use-case behind columns. If I were to stay with the NxM array,
> would reads still be chunked?
>

You would need to use the CArray (chunked array) or the EArray (extensible
array) for the underlying array on disk to be chunked.  Reading can always
be chucked by accessing a slice.  This is true for all arrays and tables.


> On the other hand, if I want to use arr[:,3] as a mask for another
> part of the array, is it more reasonable to have that be col3, in
> terms of pytables?
>

Reasonable is probably the wrong word here.  It is more that tables
do it one way and arrays do it in another.  If you are doing a lot of
single column at a time access, then you should think about using
Tables for this.


> I also get shivers when I see the looping constructs in the tutorial,
> mainly because I have learn to do only vectorized operations in numpy
> and never ever to write a python loop / list comprehension.
>

Ahh, so you have to understand which operations you do happen on
the file and when data is already in memory.  With numpy you don't
want to use Python loops because everything is already in memory.
However with Pytables most of what you are doing is pulling data from
disk *into* memory.  So the Python loop overhead is small relative to the
communication time of ram <-> disk.

Most of the loops in pytables are actually evaluated using numexpr
iterators.  Numexpr is a highly optimized way of collapsing numerical
expressions.  In short, your probably don't need to worry too much
about Python loops (when you are new to the library) when operating
on PyTables objects.  You do need to worry about such loops on the
numpy arrays that the PyTables objects return.

Be Well
Anthony


>
> Sorry for the newbie questions, feel free to point me to the FM (but
> see next email),
>
> -á.
>
>
>
> On Thu, Mar 15, 2012 at 17:57, Anthony Scopatz <scop...@gmail.com> wrote:
> > Hello Alvaro,
> >
> > Thanks for your excitement!
> >
> > On Thu, Mar 15, 2012 at 7:52 AM, Alvaro Tejero Cantero <alv...@minin.es>
> > wrote:
> >>
> >> Hi everybody!
> >>
> >> I plan to start using PyTables for an application at the University of
> >> Oxford where data is collected in sessions of 2Gb Int16 data organized
> >> as 64 parallel time series (64 detectors), each holding 15 million
> >> points (15M).
> >>
> >> I could handle this sessions separately, but ideally I would
> >> concatenate all of the sessions in a recording day, which are about up
> >> to 35. Thus I would have 70Gb of data to handle, to start with (i.e.
> >> before storing derivative data such as a band-pass filter over of the
> >> median of detectors 1 to 4).
> >>
> >> The way I interact with the data is to select some of these 64
> >> channels and apply masks on them from logical conditions on the others
> >> (and potentially other generated columns of length 780x20.000) I
> >> sometime average across channels, so I thought it is better to store
> >> these 'sessions' as one large matrix of shape 64x15M instead of 64
> >> independent columns. What do you think?
> >
> >
> > With these data sizes the overhead of columns is actually
> pretty minuscule.
> > Especially since you are going to be computing masks (probably from only
> > some of the columns), I would go ahead and just keep them in column form.
> > Storing the data as an NxM array means that you yourself need to keep
> > careful
> > track of the column indexes.  It is easier to not worry about it.
> >
> >>
> >> The next question is whether it has an impact to operate on 35 2Gb
> >> sessions separately looping over them as compared to merging them and
> >> having one long array of 64x(35x15M). What I like of the long array is
> >> getting rid of the arbitrary session boundaries and apply logical
> >> masks over one of the 64 channels in full duration, i.e. over all 35
> >> sessions concatenated.
> >
> >
> > This is really up to you and what makes sense in your mind.  There are
> > advantages both ways.  With long array you don't have to do any array
> > copies (merging) in memory.  With the split up version, if there is an
> > error in one part, it doesn't take the whole calculation down with it.
>  So
> > I would do what feels most natural.
> >
> > I hope this helps!
> >
> > Be Well
> > Anthony
> >
> >> I'd be very grateful for any advice of data layout for this amount of
> >> data.
> >>
> >> Thank you for Pytables, Francesc and the new governance team,
> >>
> >> Álvaro.
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> This SF email is sponsosred by:
> >> Try Windows Azure free for 90 days Click Here
> >> http://p.sf.net/sfu/sfd2d-msazure
> >> _______________________________________________
> >> Pytables-users mailing list
> >> Pytables-users@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/pytables-users
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > This SF email is sponsosred by:
> > Try Windows Azure free for 90 days Click Here
> > http://p.sf.net/sfu/sfd2d-msazure
> > _______________________________________________
> > Pytables-users mailing list
> > Pytables-users@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/pytables-users
> >
>
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Advice for new user

Reply via email to