Hi Alvaro,

I just want to second what Francesc is saying.  You do have to get
used to the fact that you can't do *everything* that numpy can do
inside of PyTables.  However, you can do the important stuff which
lets you pull out only the data that you are interested in.  In fact,
PyTables
lets you do many more out-of-core calculations than almost every other
database library (including HDF5).

I think that you are now at the point that you probably should implement
(on a small scale) a couple of different abstractions.  Do some timings
with
these and see which one fits your brain the best.  If you run into problems
we always welcome questions about specific code, so feel free to send it
our way!

Be Well
Anthony

On Fri, Mar 16, 2012 at 10:00 AM, Francesc Alted <fal...@gmail.com> wrote:

> On Mar 16, 2012, at 1:55 AM, Alvaro Tejero Cantero wrote:
>
> > Thanks Francesc, we're getting there :).
> >
> > Some more precise questions below.
> >
> >> Here it is how you can do that in PyTables:
> >>
> >> my_condition = '(col1>0.5) && (col2<24) && (col3 == "novel")'
> >> mycol4_values = [ r['col4'] for r in tbl.where(my_condtion) ]
> >
> > ok, but having data upon which I want to operate also across columns
> > in Table columns means that if I cannot use numpy operations across
> > that dimensions. I can either do things in a loop (such as taking the
> > max of two numbers) or resort to the subset of operations supported by
> > numexpr. To give an example: if I wanted to do the FFT of all [i]
> > values of col01... col64, I would numpy.fft(izip(col01,..,col64))?
> > (potentially .appending the result to another column in the process)
> > vs. newcol = numpy.fft(largearr, axis=whatever) Is this correct?
>
> Yes.  PyTables does not let you to perform the broad range of NumPy
> operations, but only a different subset (this is a consequence of the fact
> that PyTables functionality tries to be out-of-core, but not all problems
> can easily be solved this way).  So, for cases where PyTables does not
> implement this support you are going to need to read the whole column/array
> in memory.  My point is that you don't need to do that for *queries* or for
> the restricted subset of out-of-core operations that PyTables implements.
>
> >>> (Incidentally: is it a good idea to ask PyTables to index boolean
> >>> arrays that are going to be used for these kinds of queries?)
> >>
> >> No.  It is much better to index the columns on which you are going to
> perform your queries, and the evaluate boolean expressions on top of the
> indexes.
> >
> >
> > Ok, I am just skeptical about how much indexing can do with the
> > random-looking data that I have (see this for an example:
> > http://media.wiley.com/wires/WCS/WCS1164/nfig008.gif).
>
> Indexing is a process that precomputes a classification on your data, and
> certainly can work well on this sort of data.  I'm not saying that it will
> work in your scenario, but rather that you should give it a try if you want
> to know.
>
> > I also wonder how much compression will help in this scenario.
>
> Looks to me like a perfect time series, and not as random as you suggest.
>  You will want to use the shuffle filter here, as it is going to be
> beneficial for sure.  But my suggestion is: do your experiments first and
> then worry about the results.
>
> >> Yes, it is reasonable if you can express your problem in terms of
> tables (please note that the Table object of PyTables does have support for
> multidimensional columns).
> >
> > But that is just my question! from the point of view of the
> > abstractions in PyTables (when querying, when compressing, when
> > indexing), is it better to create many columns, even if they are
> > completely homogeneous and tedious to manage separately, or is it
> > better a huge leaf value consisting of all the columns put together in
> > an array?
>
> This largely depends on your problem.  PyTables has abstractions for
> tabular data (Table), arrays (Array, CArray, EArray) and variable length
> arrays (VLArray).  It is up to you to come up with a good way to express
> your problem on top of these.
>
> >>  Another possibility is to put all your potential indexes on a table,
> index them and then use them to access data in the big array object.
>  Example:
> >>
> >> indexes = tbl.getWhereList(my_condition)
> >> my_values = arr[indexes, 1:4]
> >
> > Ok, this is really useful (numpy.where on steroids?), because I should
> > be able to reduce my original 64 columns to a few processed ones that
> > will then be used for queries.
>
> Ok, so hopefully we have found a start for providing a structure to your
> problem :)
>
> -- Francesc Alted
>
>
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to