Thank you for these e-mails with so many useful tips! This is definitely a start. I will report what I find!
Cheers, -á. On Fri, Mar 16, 2012 at 15:00, Francesc Alted <fal...@gmail.com> wrote: > On Mar 16, 2012, at 1:55 AM, Alvaro Tejero Cantero wrote: > >> Thanks Francesc, we're getting there :). >> >> Some more precise questions below. >> >>> Here it is how you can do that in PyTables: >>> >>> my_condition = '(col1>0.5) && (col2<24) && (col3 == "novel")' >>> mycol4_values = [ r['col4'] for r in tbl.where(my_condtion) ] >> >> ok, but having data upon which I want to operate also across columns >> in Table columns means that if I cannot use numpy operations across >> that dimensions. I can either do things in a loop (such as taking the >> max of two numbers) or resort to the subset of operations supported by >> numexpr. To give an example: if I wanted to do the FFT of all [i] >> values of col01... col64, I would numpy.fft(izip(col01,..,col64))? >> (potentially .appending the result to another column in the process) >> vs. newcol = numpy.fft(largearr, axis=whatever) Is this correct? > > Yes. PyTables does not let you to perform the broad range of NumPy > operations, but only a different subset (this is a consequence of the fact > that PyTables functionality tries to be out-of-core, but not all problems can > easily be solved this way). So, for cases where PyTables does not implement > this support you are going to need to read the whole column/array in memory. > My point is that you don't need to do that for *queries* or for the > restricted subset of out-of-core operations that PyTables implements. > >>>> (Incidentally: is it a good idea to ask PyTables to index boolean >>>> arrays that are going to be used for these kinds of queries?) >>> >>> No. It is much better to index the columns on which you are going to >>> perform your queries, and the evaluate boolean expressions on top of the >>> indexes. >> >> >> Ok, I am just skeptical about how much indexing can do with the >> random-looking data that I have (see this for an example: >> http://media.wiley.com/wires/WCS/WCS1164/nfig008.gif). > > Indexing is a process that precomputes a classification on your data, and > certainly can work well on this sort of data. I'm not saying that it will > work in your scenario, but rather that you should give it a try if you want > to know. > >> I also wonder how much compression will help in this scenario. > > Looks to me like a perfect time series, and not as random as you suggest. > You will want to use the shuffle filter here, as it is going to be beneficial > for sure. But my suggestion is: do your experiments first and then worry > about the results. > >>> Yes, it is reasonable if you can express your problem in terms of tables >>> (please note that the Table object of PyTables does have support for >>> multidimensional columns). >> >> But that is just my question! from the point of view of the >> abstractions in PyTables (when querying, when compressing, when >> indexing), is it better to create many columns, even if they are >> completely homogeneous and tedious to manage separately, or is it >> better a huge leaf value consisting of all the columns put together in >> an array? > > This largely depends on your problem. PyTables has abstractions for tabular > data (Table), arrays (Array, CArray, EArray) and variable length arrays > (VLArray). It is up to you to come up with a good way to express your > problem on top of these. > >>> Another possibility is to put all your potential indexes on a table, index >>> them and then use them to access data in the big array object. Example: >>> >>> indexes = tbl.getWhereList(my_condition) >>> my_values = arr[indexes, 1:4] >> >> Ok, this is really useful (numpy.where on steroids?), because I should >> be able to reduce my original 64 columns to a few processed ones that >> will then be used for queries. > > Ok, so hopefully we have found a start for providing a structure to your > problem :) > > -- Francesc Alted > > > > > > > > ------------------------------------------------------------------------------ > This SF email is sponsosred by: > Try Windows Azure free for 90 days Click Here > http://p.sf.net/sfu/sfd2d-msazure > _______________________________________________ > Pytables-users mailing list > Pytables-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/pytables-users ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users