Re: [Pytables-users] Advice for new user

Alvaro Tejero Cantero Thu, 15 Mar 2012 14:59:06 -0700

Hi Anthony and Francesc,

please bear with me for one more.


I was thinking of having this huge array in memory and be able to
write nice indexing expressions, the kind that one writes all the time
with numpy; e.g.

arr[ fast && novel && checked, 1:4]

where fast, novel and checked are boolean arrays (or conllections of
indexes) that I have precomputed.

Would using this syntax in PyTables trigger a copy in memory of all
70Gb (or at least 3 out of 64 'detector channels'), i.e. about 3.3 Gb?

(Incidentally: is it a good idea to ask PyTables to index boolean
arrays that are going to be used for these kinds of queries?)

I had the (maybe unreasonable) expectation that PyTables could run
this on chunks transparently and get away with it without ever loading
such an amount of data into memory (with Carrays?). This kind of
notational convenience is very dear to me because I want to convert a
C-based lab to using Python, and this is a clearly visible benefit for
them.

So here's a question just to put into perspective the benefits of
PyTables (again, please bear with me). What are the gains of using a
big array in PyTables vs. having each of the 35 2Gb array loaded in
turn to memory (I have 32Gb or RAM) from binary files and operated
upon with numpy constructs?

What I am not getting is how can PyTables be faster than me chunking
the data by hand into reasonable pieces for my memory and operating on
it through lightning-fast numpy ufuncs....

If it sounds like dumb to you, then let me offer to write an
explanatory note for users in a similar case to mine, once I have
sorted it out.

Best, and thanks again,

-á.



On Thu, Mar 15, 2012 at 18:51, Francesc Alted <fal...@gmail.com> wrote:
> On Mar 15, 2012, at 1:43 PM, Anthony Scopatz wrote:
>
>> Hello Alvaro
>>
>> On Thu, Mar 15, 2012 at 1:20 PM, Alvaro Tejero Cantero <alv...@minin.es> 
>> wrote:
>> Hi!
>>
>> Thanks for the prompt answer. Actually I am not clear about switching
>> from NxM array to N columns (64 in my case). How do I make a
>> rectangular selection with columns? With an NxM array I just have to
>> do arr[10000:20000,1:4] to select columns 1,2,3 and time samples 10000
>> to 20000..
>>
>> Tables are really a 1D array of C-structs. They are basically equivalent in 
>> many
>> ways to numpy structured arrays: 
>> http://docs.scipy.org/doc/numpy/user/basics.rec.html
>> So there is no analogy to the 2D slice that you mention above.
>>
>> While it is easy to manipulate integer indices, from what
>> I've read columns would have to have string identifiers so I would be
>> doing a lot of int<>str conversion?
>>
>> No, you don't do a lot of str - int conversions.  The strs represent field 
>> names
>> and only incidentally indexes.
>>
>> My recorded data is so homogeneous (a huge 70Gb matrix of integers)
>> that I am a bit lost among all the mixed-typing that seems to be the
>> primary use-case behind columns. If I were to stay with the NxM array,
>> would reads still be chunked?
>>
>> You would need to use the CArray (chunked array) or the EArray (extensible
>> array) for the underlying array on disk to be chunked.  Reading can always
>> be chucked by accessing a slice.  This is true for all arrays and tables.
>>
>> On the other hand, if I want to use arr[:,3] as a mask for another
>> part of the array, is it more reasonable to have that be col3, in
>> terms of pytables?
>>
>> Reasonable is probably the wrong word here.  It is more that tables
>> do it one way and arrays do it in another.  If you are doing a lot of
>> single column at a time access, then you should think about using
>> Tables for this.
>>
>> I also get shivers when I see the looping constructs in the tutorial,
>> mainly because I have learn to do only vectorized operations in numpy
>> and never ever to write a python loop / list comprehension.
>>
>> Ahh, so you have to understand which operations you do happen on
>> the file and when data is already in memory.  With numpy you don't
>> want to use Python loops because everything is already in memory.
>> However with Pytables most of what you are doing is pulling data from
>> disk into memory.  So the Python loop overhead is small relative to the
>> communication time of ram <-> disk.
>>
>> Most of the loops in pytables are actually evaluated using numexpr
>> iterators.  Numexpr is a highly optimized way of collapsing numerical
>> expressions.  In short, your probably don't need to worry too much
>> about Python loops (when you are new to the library) when operating
>> on PyTables objects.  You do need to worry about such loops on the
>> numpy arrays that the PyTables objects return.
>
> Anthony is very right here.  If you have very large amounts of data, you 
> absolutely need to get used to the iterator concept, as this allows you to 
> run into all your dataset without a need to load it in-memory.  Iterators in 
> PyTables are one of its most powerful and effective constructions, so be sure 
> that you master them if you want to get the most out of PyTables.
>
> -- Francesc Alted
>
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Advice for new user

Reply via email to