Hi Anthony,
On 11/17/2012 11:49 AM, Anthony Scopatz wrote:
Hi Jon,
Barring changes to numexpr itself, this is exactly what I am
suggesting. Well,, either writing one query expr per bin or (more
cleverly) writing one expr which when evaluated for a row returns the
integer bin number (1, 2, 3,...) this row falls in. Then you can
simply count() for each bin number.
For example, if you wanted to histogram data which ran from [0,100]
into 10 bins, then the expr "r/10" into a dtype=int would do the
trick. This has the advantage of only running over the data once.
(Also, I am not convinced that running over the data multiple times
is less efficient than doing row-based iteration. You would have to
test it on your data to find out.)
It is a reduction operation, and would greatly benefit from
chunking, I expect. Not unlike sum(), which is implemented as a
specially supported reduction operation inside numexpr (buggily,
last I checked). I suspect that a substantial improvement in
histogramming requires direct support from either pytables or from
numexpr. I don't suppose that there might be a chunked-reduction
interface exposed somewhere that I could hook into?
This is definitively as feature to request from numexpr.
I've been fiddling around with Stephen's code a bit, and it looks like
the best way to do things is to read chunks (whether exactly of
table.chunksize or not is a matter for optimization) of the data in at a
time, and create histograms of those chunks. Then combining the
histograms is a trivial sum operation. This type of approach can be
generically applied in many cases, I suspect, where row-by-row iteration
is prohibitively slow, but the dataset is too large to fit into memory.
As I understand, this idea is the primary win of PyTables in the first
place!
So, I think it would be extraordinarily helpful to provide a
chunked-iteration interface for this sort of use case. It can be as
simple as a wrapper around Table.read():
class Table:
def chunkiter(self, field=None):
while n*self.chunksize < self.nrows:
yield self.read(n*self.chunksize, (n+1)*self.chunksize,
field=field)
Then I can write something like
bins = linspace(-1,1, 101)
hist = sum(histogram(chunk, bins=bins) for chunk in
mytable.chunkiter(myfield))
Preliminary tests seem to indicate that, for a table with 1 column and
10M rows, reading in "chunks" of 10x chunksize gives the best
read-time-per-row. This is perhaps naive as regards chunksize black
magic, though...
And of course, if implemented by numexpr, it could benefit from the nice
automatic multithreading there.
Also, I might dig in a bit and see about extending the "field" argument
to read so it can read multiple fields at once (to do N-dimensional
histograms), as you suggested in a previous mail some months ago.
Best Regards,
Jon
------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users