On Mon, Nov 19, 2012 at 12:59 PM, Jon Wilson <j...@fnal.gov> wrote:

>  Hi Anthony,
>
>
>
>
> On 11/17/2012 11:49 AM, Anthony Scopatz wrote:
>
>  Hi Jon,
>
>  Barring changes to numexpr itself, this is exactly what I am suggesting.
>  Well,, either writing one query expr per bin or (more cleverly) writing
> one expr which when evaluated for a row returns the integer bin number (1,
> 2, 3,...) this row falls in.  Then you can simply count() for each bin
> number.
>
>  For example, if you wanted to histogram data which ran from [0,100] into
> 10 bins, then the expr "r/10" into a dtype=int would do the trick.  This
> has the advantage of only running over the data once.  (Also, I am not
> convinced that running over the data multiple times is less efficient than
> doing row-based iteration.  You would have to test it on your data to find
> out.)
>
>
>>  It is a reduction operation, and would greatly benefit from chunking, I
>> expect. Not unlike sum(), which is implemented as a specially supported
>> reduction operation inside numexpr (buggily, last I checked). I suspect
>> that a substantial improvement in histogramming requires direct support
>> from either pytables or from numexpr. I don't suppose that there might be a
>> chunked-reduction interface exposed somewhere that I could hook into?
>>
>
>  This is definitively as feature to request from numexpr.
>
> I've been fiddling around with Stephen's code a bit, and it looks like the
> best way to do things is to read chunks (whether exactly of table.chunksize
> or not is a matter for optimization) of the data in at a time, and create
> histograms of those chunks.  Then combining the histograms is a trivial sum
> operation.  This type of approach can be generically applied in many cases,
> I suspect, where row-by-row iteration is prohibitively slow, but the
> dataset is too large to fit into memory.  As I understand, this idea is the
> primary win of PyTables in the first place!
>
> So, I think it would be extraordinarily helpful to provide a
> chunked-iteration interface for this sort of use case.  It can be as simple
> as a wrapper around Table.read():
>
> class Table:
>     def chunkiter(self, field=None):
>         while n*self.chunksize < self.nrows:
>             yield self.read(n*self.chunksize, (n+1)*self.chunksize,
> field=field)
>
> Then I can write something like
> bins = linspace(-1,1, 101)
> hist = sum(histogram(chunk, bins=bins) for chunk in
> mytable.chunkiter(myfield))
>
> Preliminary tests seem to indicate that, for a table with 1 column and 10M
> rows, reading in "chunks" of 10x chunksize gives the best
> read-time-per-row.  This is perhaps naive as regards chunksize black magic,
> though...
>

Hello Jon,

Sorry about the slow reply, but I think that what is proposed in issue #27
[1] would solve the above by default, right?  Maybe you could pull Josh's
code and test it on the above example to make sure.  And then we could go
ahead and merge this in :).


> And of course, if implemented by numexpr, it could benefit from the nice
> automatic multithreading there.
>

This would be nice, but as you point out, not totally necessary here.


>
> Also, I might dig in a bit and see about extending the "field" argument to
> read so it can read multiple fields at once (to do N-dimensional
> histograms), as you suggested in a previous mail some months ago.
>

Also super cool, but not immediate ;)

Be Well
Anthony

1. https://github.com/PyTables/PyTables/issues/27


> Best Regards,
> Jon
>
------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to