Hi Anthony and Francesc, please bear with me for one more.
I was thinking of having this huge array in memory and be able to write nice indexing expressions, the kind that one writes all the time with numpy; e.g. arr[ fast && novel && checked, 1:4] where fast, novel and checked are boolean arrays (or conllections of indexes) that I have precomputed. Would using this syntax in PyTables trigger a copy in memory of all 70Gb (or at least 3 out of 64 'detector channels'), i.e. about 3.3 Gb? (Incidentally: is it a good idea to ask PyTables to index boolean arrays that are going to be used for these kinds of queries?) I had the (maybe unreasonable) expectation that PyTables could run this on chunks transparently and get away with it without ever loading such an amount of data into memory (with Carrays?). This kind of notational convenience is very dear to me because I want to convert a C-based lab to using Python, and this is a clearly visible benefit for them. So here's a question just to put into perspective the benefits of PyTables (again, please bear with me). What are the gains of using a big array in PyTables vs. having each of the 35 2Gb array loaded in turn to memory (I have 32Gb or RAM) from binary files and operated upon with numpy constructs? What I am not getting is how can PyTables be faster than me chunking the data by hand into reasonable pieces for my memory and operating on it through lightning-fast numpy ufuncs.... If it sounds like dumb to you, then let me offer to write an explanatory note for users in a similar case to mine, once I have sorted it out. Best, and thanks again, -á. On Thu, Mar 15, 2012 at 18:51, Francesc Alted <fal...@gmail.com> wrote: > On Mar 15, 2012, at 1:43 PM, Anthony Scopatz wrote: > >> Hello Alvaro >> >> On Thu, Mar 15, 2012 at 1:20 PM, Alvaro Tejero Cantero <alv...@minin.es> >> wrote: >> Hi! >> >> Thanks for the prompt answer. Actually I am not clear about switching >> from NxM array to N columns (64 in my case). How do I make a >> rectangular selection with columns? With an NxM array I just have to >> do arr[10000:20000,1:4] to select columns 1,2,3 and time samples 10000 >> to 20000.. >> >> Tables are really a 1D array of C-structs. They are basically equivalent in >> many >> ways to numpy structured arrays: >> http://docs.scipy.org/doc/numpy/user/basics.rec.html >> So there is no analogy to the 2D slice that you mention above. >> >> While it is easy to manipulate integer indices, from what >> I've read columns would have to have string identifiers so I would be >> doing a lot of int<>str conversion? >> >> No, you don't do a lot of str - int conversions. The strs represent field >> names >> and only incidentally indexes. >> >> My recorded data is so homogeneous (a huge 70Gb matrix of integers) >> that I am a bit lost among all the mixed-typing that seems to be the >> primary use-case behind columns. If I were to stay with the NxM array, >> would reads still be chunked? >> >> You would need to use the CArray (chunked array) or the EArray (extensible >> array) for the underlying array on disk to be chunked. Reading can always >> be chucked by accessing a slice. This is true for all arrays and tables. >> >> On the other hand, if I want to use arr[:,3] as a mask for another >> part of the array, is it more reasonable to have that be col3, in >> terms of pytables? >> >> Reasonable is probably the wrong word here. It is more that tables >> do it one way and arrays do it in another. If you are doing a lot of >> single column at a time access, then you should think about using >> Tables for this. >> >> I also get shivers when I see the looping constructs in the tutorial, >> mainly because I have learn to do only vectorized operations in numpy >> and never ever to write a python loop / list comprehension. >> >> Ahh, so you have to understand which operations you do happen on >> the file and when data is already in memory. With numpy you don't >> want to use Python loops because everything is already in memory. >> However with Pytables most of what you are doing is pulling data from >> disk into memory. So the Python loop overhead is small relative to the >> communication time of ram <-> disk. >> >> Most of the loops in pytables are actually evaluated using numexpr >> iterators. Numexpr is a highly optimized way of collapsing numerical >> expressions. In short, your probably don't need to worry too much >> about Python loops (when you are new to the library) when operating >> on PyTables objects. You do need to worry about such loops on the >> numpy arrays that the PyTables objects return. > > Anthony is very right here. If you have very large amounts of data, you > absolutely need to get used to the iterator concept, as this allows you to > run into all your dataset without a need to load it in-memory. Iterators in > PyTables are one of its most powerful and effective constructions, so be sure > that you master them if you want to get the most out of PyTables. > > -- Francesc Alted > > > > > > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pytables-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/pytables-users ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users