Re: [Pytables-users] Advice for new user

Francesc Alted Fri, 16 Mar 2012 08:01:05 -0700

On Mar 16, 2012, at 1:55 AM, Alvaro Tejero Cantero wrote:

> Thanks Francesc, we're getting there :).
> 
> Some more precise questions below.
> 
>> Here it is how you can do that in PyTables:
>> 
>> my_condition = '(col1>0.5) && (col2<24) && (col3 == "novel")'
>> mycol4_values = [ r['col4'] for r in tbl.where(my_condtion) ]
> 
> ok, but having data upon which I want to operate also across columns
> in Table columns means that if I cannot use numpy operations across
> that dimensions. I can either do things in a loop (such as taking the
> max of two numbers) or resort to the subset of operations supported by
> numexpr. To give an example: if I wanted to do the FFT of all [i]
> values of col01... col64, I would numpy.fft(izip(col01,..,col64))?
> (potentially .appending the result to another column in the process)
> vs. newcol = numpy.fft(largearr, axis=whatever) Is this correct?


Yes.  PyTables does not let you to perform the broad range of NumPy operations, 
but only a different subset (this is a consequence of the fact that PyTables 
functionality tries to be out-of-core, but not all problems can easily be 
solved this way).  So, for cases where PyTables does not implement this support 
you are going to need to read the whole column/array in memory.  My point is 
that you don't need to do that for *queries* or for the restricted subset of 
out-of-core operations that PyTables implements.

>>> (Incidentally: is it a good idea to ask PyTables to index boolean
>>> arrays that are going to be used for these kinds of queries?)
>> 
>> No.  It is much better to index the columns on which you are going to 
>> perform your queries, and the evaluate boolean expressions on top of the 
>> indexes.
> 
> 
> Ok, I am just skeptical about how much indexing can do with the
> random-looking data that I have (see this for an example:
> http://media.wiley.com/wires/WCS/WCS1164/nfig008.gif).

Indexing is a process that precomputes a classification on your data, and 
certainly can work well on this sort of data.  I'm not saying that it will work 
in your scenario, but rather that you should give it a try if you want to know.

> I also wonder how much compression will help in this scenario.

Looks to me like a perfect time series, and not as random as you suggest.  You 
will want to use the shuffle filter here, as it is going to be beneficial for 
sure.  But my suggestion is: do your experiments first and then worry about the 
results.

>> Yes, it is reasonable if you can express your problem in terms of tables 
>> (please note that the Table object of PyTables does have support for 
>> multidimensional columns).
> 
> But that is just my question! from the point of view of the
> abstractions in PyTables (when querying, when compressing, when
> indexing), is it better to create many columns, even if they are
> completely homogeneous and tedious to manage separately, or is it
> better a huge leaf value consisting of all the columns put together in
> an array?

This largely depends on your problem.  PyTables has abstractions for tabular 
data (Table), arrays (Array, CArray, EArray) and variable length arrays 
(VLArray).  It is up to you to come up with a good way to express your problem 
on top of these.

>>  Another possibility is to put all your potential indexes on a table, index 
>> them and then use them to access data in the big array object.  Example:
>> 
>> indexes = tbl.getWhereList(my_condition)
>> my_values = arr[indexes, 1:4]
> 
> Ok, this is really useful (numpy.where on steroids?), because I should
> be able to reduce my original 64 columns to a few processed ones that
> will then be used for queries.

Ok, so hopefully we have found a start for providing a structure to your 
problem :)

-- Francesc Alted







------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Advice for new user

Reply via email to