Re: [Pytables-users] Combining in-kernel queries with out-of-core computations

Jon Wilson Tue, 05 Jun 2012 20:32:24 -0700

Hi Anthony,
Allow me to clarify.  I wish to perform a reduction (histogramming,
specifically) over a function of some values, but only including certain
rows.  For instance, say I have a table with three columns, col0 --
col3.  I would like to create a histogram of col0 + col1**2, but only
where col2 > 15 and abs(col3) < 5.  As far as I understand, I can do the
following:
histogram(array([row['col0'] + row['col1']**2 for row in
mytable.where('col2 > 15 & abs(col3) < 5')]))

And this does produce the desired histogram.  However, mytable.where()
returns an iterator over rows, and then the list comprehension computes
col0 + col1**2 for each row in python space, which lacks the
optimization and multithreading of the numexpr kernel.  It seems as
though it should be possible to have both the condition and the
histogramming variable (col0 + col1**2) computed in the parallelized and
optimized numexpr kernel, but I do not see a way to do this using where().

The alternative that I can see would be to do something like:
histvar = tables.Expr('col0 + col1**2', vars(mytable.cols)).eval()
selection = tables.Expr('col2 > 15 & abs(col3) < 5',
vars(mytable.cols)).eval()
histogram(histvar, weights = selection)

This should produce the same histogram as above, and it does compute
both the histogram variable and the query condition in the numexpr
kernel, but it requires the computation of the histogram variable even
for rows I do not wish to include in the histogram.  If the table is
very large and relatively few rows are selected, or if computing the
histogram variable is expensive, this is quite undesirable.

So, it seems that I can either a) use the fast query operator where();
or, b) perform all computation in numexpr.  But not both.

FWIW, a quick timeit test shows that, on a table with ~1M rows, for a
very simple condition and a very simple histogram variable, the first
method is faster than the second method even when all rows are selected.
For a table with ~7M rows, for a more complex histogram variable and
still a very simple condition, the first method is faster than the
second method when only a few rows are selected, but when all rows are
selected, the second method is more than 10x faster.  (2.16s vs 3.27s
for few rows, 43.1s vs 3.19s for all 7M rows)  So it is clear that in
some cases, method 2 could be sped up substantially, and in other cases,
method 1 could be sped up enormously.

I think something like
histogram(tables.Expr('col0 + col1**2', mytable.where('col2 > 15 &
abs(col3) < 5')).eval())
would be ideal, but since where() returns a row iterator, and not
something that I can extract Column objects from, I don't see any way to
make it work.

So, am I missing some way to compute the histogram variable in the
numexpr kernel, but only for rows I'm interested in?
Regards,
Jon

On 06/05/2012 09:45 PM, Anthony Scopatz wrote:
> Hello Jon, 
>
> I believe that the where() method just uses the Expr / numexpr
> functionality under the covers.  Anything that you can do in Expr you
> should be able to do in where().  Can you provide a short example
> where this is not the case?
>
> Be Well
> Anthony
>
> On Tue, Jun 5, 2012 at 6:17 PM, Jon Wilson <j...@fnal.gov
> <mailto:j...@fnal.gov>> wrote:
>
>     Hi all,
>     In looking through the docs, I see two very nice features: the
>     .where()
>     query method, and the tables.Expr computation mechanism.  But, it
>     doesn't appear to be possible to combine the two.  It appears
>     that, if I
>     want to compute some function of my columns, but only for certain
>     rows,
>     I have two options.
>      - I can use tables.Expr to compute the function, and then filter the
>     results in python
>      - I can use mytable.where() to select the rows I'm interested in, and
>     then compute the function in python
>
>     Am I missing anything?  Is it possible to perform fast out-of-core
>     computations with numexpr, but only on a subset of the existing rows?
>     Regards,
>     Jon Wilson
>
>     
> ------------------------------------------------------------------------------
>     Live Security Virtual Conference
>     Exclusive live event will cover all the ways today's security and
>     threat landscape has changed and how IT managers can respond.
>     Discussions
>     will include endpoint security, mobile security and the latest in
>     malware
>     threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>     _______________________________________________
>     Pytables-users mailing list
>     Pytables-users@lists.sourceforge.net
>     <mailto:Pytables-users@lists.sourceforge.net>
>     https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and 
> threat landscape has changed and how IT managers can respond. Discussions 
> will include endpoint security, mobile security and the latest in malware 
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>
>
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Combining in-kernel queries with out-of-core computations

Reply via email to