On Wed, Jun 6, 2012 at 10:24 AM, Jon Wilson <j...@fnal.gov> wrote:

>  Hi Anthony,
>
>
> On 06/06/2012 12:45 AM, Anthony Scopatz wrote:
>
>
>  I think something like
>> histogram(tables.Expr('col0 + col1**2', mytable.where('col2 > 15 &
>> abs(col3) < 5')).eval())
>> would be ideal, but since where() returns a row iterator, and not
>> something that I can extract Column objects from, I don't see any way to
>> make it work.
>>
>
>  You are probably looking for the readWhere() 
> method<http://pytables.github.com/usersguide/libref.html#tables.Table.readWhere>
>  which
> normally returns a numpy structured array.  The line you are looking for is
> thus:
>
>  histogram(tables.Expr('col0 + col1**2', mytable.readWhere('col2 > 15 &
> abs(col3) < 5')).eval())
>
>  This will likely be fast in both cases.  I hope this helps.
>
>
> Oddly, it doesn't work with tables.Expr, but does work with
> numexpr.evaluate.  In the case I talked about before with 7M rows, when
> selecting very few rows, it does just fine (between the other two
> solutions), but when selecting all rows, it is still about 2.75x slower
> than the technique of using tables.Expr for both the histogram var and the
> condition.
>
> I think that this is because .readWhere() pulls all the table rows
> satisfying the where condition into memory first, and it furthermore does
> so for all columns of all selected rows, so, for a table with many columns,
> it has to read many times as much data into memory.
>

Yes that is correct, it does have to read the data into memory.


>  I can use the field parameter, but it only accepts one single field, so I
> would have to perform the query once per variable used in the histogram
> variable expression to do that.
>
> Using .readWhere() gives a medium-fast performance in both cases, but I
> still feel like it is not quite the right thing because it reads the data
> completely into memory instead of allowing the computation to be performed
> out-of-core.  Perhaps it is not really feasible,
>

Well I think the issue at hand is that you are trying to support two
disparate cases with one expression: sparse and dense selection.  We have
tools for dealing with these cases individually and performing out-of-core
calculations.  And if you know a priori which case you are going to fall
into, you can do the right thing.  So without doing anything special, I
think medium-fast is probably the best and easiest thing that you can
expect right now.  (Though I would be delighted to be proved wrong on this
point.)


> but I think the ideal would be to have a .where type query operator that
> returns Column objects or a Table object, with a "view" imposed in either
> case.
>

We are very open to pull requests if you come up with an implementation of
this that you like more ;).

Be Well
Anthony


> Regards,
> Jon
>
>
> ------------------------------------------------------------------------------
> Live Security Virtual Conference
> Exclusive live event will cover all the ways today's security and
> threat landscape has changed and how IT managers can respond. Discussions
> will include endpoint security, mobile security and the latest in malware
> threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
> _______________________________________________
> Pytables-users mailing list
> Pytables-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/pytables-users
>
>
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to