Hola Alvaro, On Mar 15, 2012, at 4:58 PM, Alvaro Tejero Cantero wrote:
> Hi Anthony and Francesc, > > please bear with me for one more. > > I was thinking of having this huge array in memory and be able to > write nice indexing expressions, the kind that one writes all the time > with numpy; e.g. > > arr[ fast && novel && checked, 1:4] > > where fast, novel and checked are boolean arrays (or conllections of > indexes) that I have precomputed. What PyTables allows you is, once you have expressed your problem in terms of tables (that is, a collection of records, a la NumPy structured arrays), you can query data by using potentially complex boolean expressions. For example, let's suppose that you have a table with 5 columns, and you want to extract values in col4 that fulfill some condition, like for example '(col1>0.5) && (col2<24) && (col3 == "novel")'. Here it is how you can do that in PyTables: my_condition = '(col1>0.5) && (col2<24) && (col3 == "novel")' mycol4_values = [ r['col4'] for r in tbl.where(my_condtion) ] Of course, you won't need to precompute the conditions and put them in big boolean arrays, which is a big memory savings feature. > Would using this syntax in PyTables trigger a copy in memory of all > 70Gb (or at least 3 out of 64 'detector channels'), i.e. about 3.3 Gb? Your original syntax yes. The syntax that I proposed to you, no. > (Incidentally: is it a good idea to ask PyTables to index boolean > arrays that are going to be used for these kinds of queries?) No. It is much better to index the columns on which you are going to perform your queries, and the evaluate boolean expressions on top of the indexes. > I had the (maybe unreasonable) expectation that PyTables could run > this on chunks transparently and get away with it without ever loading > such an amount of data into memory (with Carrays?). Yes, it is reasonable if you can express your problem in terms of tables (please note that the Table object of PyTables does have support for multidimensional columns). Another possibility is to put all your potential indexes on a table, index them and then use them to access data in the big array object. Example: indexes = tbl.getWhereList(my_condition) my_values = arr[indexes, 1:4] > This kind of > notational convenience is very dear to me because I want to convert a > C-based lab to using Python, and this is a clearly visible benefit for > them. You may love the NumPy notation, but if you are going to use PyTables, then I think you will need to start loving the `Table.where` iterator and friends too :) > So here's a question just to put into perspective the benefits of > PyTables (again, please bear with me). What are the gains of using a > big array in PyTables vs. having each of the 35 2Gb array loaded in > turn to memory (I have 32Gb or RAM) from binary files and operated > upon with numpy constructs? Basically two: 1) Much less memory consumption 2) If you use the indexing capability, your queries will be faster (i.e. you don't need to run into all you values, as PyTables is using the indexes on-disk). > What I am not getting is how can PyTables be faster than me chunking > the data by hand into reasonable pieces for my memory and operating on > it through lightning-fast numpy ufuncs…. It's easy, just by using a very traditional technology called indexing: http://en.wikipedia.org/wiki/Database_index > If it sounds like dumb to you, then let me offer to write an > explanatory note for users in a similar case to mine, once I have > sorted it out. Hope things are clearer now. Hasta luego, -- Francesc Alted ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users