Looks to me like Rick's version is simpler and faster.It looks like it offers a speed-up of about 1.6 on my machine over the weave version. I believe this is because the sorting approach results in quite a few less compares than the algorithm I used.
Very cool. I vote that his version go into numpy. eric Rick White wrote: > On Dec 12, 2006, at 10:27 PM, Cameron Walsh wrote: > > >> I'm trying to generate histograms of extremely large datasets. I've >> tried a few methods, listed below, all with their own shortcomings. >> Mailing-list archive and google searches have not revealed any >> solutions. >> > > The numpy.histogram function can be modified to use memory much more > efficiently when the input array is large, and the modification turns > out to be faster even for smallish arrays (in my tests, anyway). > Below is a modified version of the histogram function from > function_base.py. It is almost identical, but it avoids doing the > sort of the entire input array simply by dividing it into blocks. > (It would be even better to avoid the call to ravel too.) The only > other messy detail is that the builtin range function is shadowed by > the 'range' parameter. > > In my timing tests this is about the same speed for arrays about the > same size as the block size and is faster than the current version by > 30-40% for large arrays. The speed difference increases as the array > size increases. > > I haven't compared this to Eric's weave function, but this has the > advantages of being pure Python and of being much simpler. On my > machine (MacBook Pro) it takes about 4 seconds for an array with 100 > million elements. The time increases perfectly linearly with array > size for arrays larger than a million elements. > Rick > > from numpy import * > > lrange = range > def histogram(a, bins=10, range=None, normed=False): > a = asarray(a).ravel() > if not iterable(bins): > if range is None: > range = (a.min(), a.max()) > mn, mx = [mi+0.0 for mi in range] > if mn == mx: > mn -= 0.5 > mx += 0.5 > bins = linspace(mn, mx, bins, endpoint=False) > > # best block size probably depends on processor cache size > block = 65536 > n = sort(a[:block]).searchsorted(bins) > for i in lrange(block,len(a),block): > n += sort(a[i:i+block]).searchsorted(bins) > n = concatenate([n, [len(a)]]) > n = n[1:]-n[:-1] > > if normed: > db = bins[1] - bins[0] > return 1.0/(a.size*db) * n, bins > else: > return n, bins > > _______________________________________________ > Numpy-discussion mailing list > Numpy-discussion@scipy.org > http://projects.scipy.org/mailman/listinfo/numpy-discussion > > _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion