Re: [Pytables-users] Histogramming 1000x too slow

Jon Wilson Mon, 19 Nov 2012 13:00:14 -0800

Hi Anthony,



On 11/17/2012 11:49 AM, Anthony Scopatz wrote:

Hi Jon,
Barring changes to numexpr itself, this is exactly what I amsuggesting. Well,, either writing one query expr per bin or (morecleverly) writing one expr which when evaluated for a row returns theinteger bin number (1, 2, 3,...) this row falls in. Then you cansimply count() for each bin number.
For example, if you wanted to histogram data which ran from [0,100]into 10 bins, then the expr "r/10" into a dtype=int would do thetrick. This has the advantage of only running over the data once.(Also, I am not convinced that running over the data multiple timesis less efficient than doing row-based iteration. You would have totest it on your data to find out.)
    It is a reduction operation, and would greatly benefit from
    chunking, I expect. Not unlike sum(), which is implemented as a
    specially supported reduction operation inside numexpr (buggily,
    last I checked). I suspect that a substantial improvement in
    histogramming requires direct support from either pytables or from
    numexpr. I don't suppose that there might be a chunked-reduction
    interface exposed somewhere that I could hook into?


This is definitively as feature to request from numexpr.

I've been fiddling around with Stephen's code a bit, and it looks likethe best way to do things is to read chunks (whether exactly oftable.chunksize or not is a matter for optimization) of the data in at atime, and create histograms of those chunks. Then combining thehistograms is a trivial sum operation. This type of approach can begenerically applied in many cases, I suspect, where row-by-row iterationis prohibitively slow, but the dataset is too large to fit into memory.As I understand, this idea is the primary win of PyTables in the firstplace!

So, I think it would be extraordinarily helpful to provide achunked-iteration interface for this sort of use case. It can be assimple as a wrapper around Table.read():


class Table:
    def chunkiter(self, field=None):
        while n*self.chunksize < self.nrows:

yield self.read(n*self.chunksize, (n+1)*self.chunksize,field=field)


Then I can write something like
bins = linspace(-1,1, 101)

hist = sum(histogram(chunk, bins=bins) for chunk inmytable.chunkiter(myfield))

Preliminary tests seem to indicate that, for a table with 1 column and10M rows, reading in "chunks" of 10x chunksize gives the bestread-time-per-row. This is perhaps naive as regards chunksize blackmagic, though...

And of course, if implemented by numexpr, it could benefit from the niceautomatic multithreading there.

Also, I might dig in a bit and see about extending the "field" argumentto read so it can read multiple fields at once (to do N-dimensionalhistograms), as you suggested in a previous mail some months ago.

Best Regards,
Jon

------------------------------------------------------------------------------
Monitor your physical, virtual and cloud infrastructure from a single
web console. Get in-depth insight into apps, servers, databases, vmware,
SAP, cloud infrastructure, etc. Download 30-day Free Trial.
Pricing starts from $795 for 25 servers or applications!
http://p.sf.net/sfu/zoho_dev2dev_nov

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Histogramming 1000x too slow

Reply via email to