On Dec 12, 2006, at 10:27 PM, Cameron Walsh wrote: > I'm trying to generate histograms of extremely large datasets. I've > tried a few methods, listed below, all with their own shortcomings. > Mailing-list archive and google searches have not revealed any > solutions.
The numpy.histogram function can be modified to use memory much more efficiently when the input array is large, and the modification turns out to be faster even for smallish arrays (in my tests, anyway). Below is a modified version of the histogram function from function_base.py. It is almost identical, but it avoids doing the sort of the entire input array simply by dividing it into blocks. (It would be even better to avoid the call to ravel too.) The only other messy detail is that the builtin range function is shadowed by the 'range' parameter. In my timing tests this is about the same speed for arrays about the same size as the block size and is faster than the current version by 30-40% for large arrays. The speed difference increases as the array size increases. I haven't compared this to Eric's weave function, but this has the advantages of being pure Python and of being much simpler. On my machine (MacBook Pro) it takes about 4 seconds for an array with 100 million elements. The time increases perfectly linearly with array size for arrays larger than a million elements. Rick from numpy import * lrange = range def histogram(a, bins=10, range=None, normed=False): a = asarray(a).ravel() if not iterable(bins): if range is None: range = (a.min(), a.max()) mn, mx = [mi+0.0 for mi in range] if mn == mx: mn -= 0.5 mx += 0.5 bins = linspace(mn, mx, bins, endpoint=False) # best block size probably depends on processor cache size block = 65536 n = sort(a[:block]).searchsorted(bins) for i in lrange(block,len(a),block): n += sort(a[i:i+block]).searchsorted(bins) n = concatenate([n, [len(a)]]) n = n[1:]-n[:-1] if normed: db = bins[1] - bins[0] return 1.0/(a.size*db) * n, bins else: return n, bins _______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion