Thank you for these e-mails with so many useful tips! This is
definitely a start. I will report what I find!

Cheers,

-á.



On Fri, Mar 16, 2012 at 15:00, Francesc Alted <fal...@gmail.com> wrote:
> On Mar 16, 2012, at 1:55 AM, Alvaro Tejero Cantero wrote:
>
>> Thanks Francesc, we're getting there :).
>>
>> Some more precise questions below.
>>
>>> Here it is how you can do that in PyTables:
>>>
>>> my_condition = '(col1>0.5) && (col2<24) && (col3 == "novel")'
>>> mycol4_values = [ r['col4'] for r in tbl.where(my_condtion) ]
>>
>> ok, but having data upon which I want to operate also across columns
>> in Table columns means that if I cannot use numpy operations across
>> that dimensions. I can either do things in a loop (such as taking the
>> max of two numbers) or resort to the subset of operations supported by
>> numexpr. To give an example: if I wanted to do the FFT of all [i]
>> values of col01... col64, I would numpy.fft(izip(col01,..,col64))?
>> (potentially .appending the result to another column in the process)
>> vs. newcol = numpy.fft(largearr, axis=whatever) Is this correct?
>
> Yes.  PyTables does not let you to perform the broad range of NumPy 
> operations, but only a different subset (this is a consequence of the fact 
> that PyTables functionality tries to be out-of-core, but not all problems can 
> easily be solved this way).  So, for cases where PyTables does not implement 
> this support you are going to need to read the whole column/array in memory.  
> My point is that you don't need to do that for *queries* or for the 
> restricted subset of out-of-core operations that PyTables implements.
>
>>>> (Incidentally: is it a good idea to ask PyTables to index boolean
>>>> arrays that are going to be used for these kinds of queries?)
>>>
>>> No.  It is much better to index the columns on which you are going to 
>>> perform your queries, and the evaluate boolean expressions on top of the 
>>> indexes.
>>
>>
>> Ok, I am just skeptical about how much indexing can do with the
>> random-looking data that I have (see this for an example:
>> http://media.wiley.com/wires/WCS/WCS1164/nfig008.gif).
>
> Indexing is a process that precomputes a classification on your data, and 
> certainly can work well on this sort of data.  I'm not saying that it will 
> work in your scenario, but rather that you should give it a try if you want 
> to know.
>
>> I also wonder how much compression will help in this scenario.
>
> Looks to me like a perfect time series, and not as random as you suggest.  
> You will want to use the shuffle filter here, as it is going to be beneficial 
> for sure.  But my suggestion is: do your experiments first and then worry 
> about the results.
>
>>> Yes, it is reasonable if you can express your problem in terms of tables 
>>> (please note that the Table object of PyTables does have support for 
>>> multidimensional columns).
>>
>> But that is just my question! from the point of view of the
>> abstractions in PyTables (when querying, when compressing, when
>> indexing), is it better to create many columns, even if they are
>> completely homogeneous and tedious to manage separately, or is it
>> better a huge leaf value consisting of all the columns put together in
>> an array?
>
> This largely depends on your problem.  PyTables has abstractions for tabular 
> data (Table), arrays (Array, CArray, EArray) and variable length arrays 
> (VLArray).  It is up to you to come up with a good way to express your 
> problem on top of these.
>
>>>  Another possibility is to put all your potential indexes on a table, index 
>>> them and then use them to access data in the big array object.  Example:
>>>
>>> indexes = tbl.getWhereList(my_condition)
>>> my_values = arr[indexes, 1:4]
>>
>> Ok, this is really useful (numpy.where on steroids?), because I should
>> be able to reduce my original 64 columns to a few processed ones that
>> will then be used for queries.
>
> Ok, so hopefully we have found a start for providing a structure to your 
> problem :)
>
> -- Francesc Alted
>
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to