Re: [Numpy-discussion] Histograms of extremely large data sets

eric jones Thu, 14 Dec 2006 11:25:25 -0800


Rick White wrote:

Just so we don't get too smug about the speed, if I do this in IDL onthe same machine it is 10 times faster (0.28 seconds instead of 4seconds). I'm sure the IDL version uses the much faster approach ofjust sweeping through the array once, incrementing counts in theappropriate bins. It only handles equal-sized bins, so it is not asgeneral as the numpy version -- but equal-sized bins is a very commoncase. I'd still like to see a C version of histogram (which I guesswould need to be a ufunc) go into the core numpy.

Yes, this gets rid of the search, and indices can just be caluclatedfrom offsets. I've attached a modified weaved histogram that takes thisapproach. Running the snippet below on my machine takes .118 sec forthe evenly binned weave algorithm and 0.385 sec for Rick's algorithm on5 million elements. That is close to 4x faster (but not 10x...), sothere is indeed some speed to be gained for the common special case. Idon't know if the code I wrote has a 2x gain left in it, but I've spentzero time optimizing it. I'd bet it can be improved substantially.


eric

### test_weave_even_histogram.py

from numpy import arange, product, sum, zeros, uint8
from numpy.random import randint

import weave_even_histogram

import time

shape = 1000,1000,5
size = product(shape)
data = randint(0,256,size).astype(uint8)
bins = arange(256+1)

print 'type:', data.dtype
print 'millions of elements:', size/1e6

bin_start = 0
bin_size = 1
bin_count = 256
t1 = time.clock()
res = weave_even_histogram.histogram(data, bin_start, bin_size, bin_count)
t2 = time.clock()
print 'sec (evenly spaced):', t2-t1, sum(res)
print res

                                        Rick
_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

from numpy import array, zeros, asarray, sort, int32
from scipy import weave
from typed_array_converter import converters

def histogram(ary, bin_start, bin_size, bin_count):
    
    ary = asarray(ary)
    
    # Create an array to hold the histogram count results.
    results = zeros(bin_count,dtype=int32)
    
    # The C++ code that actually does the histogramming.    
    code = """
           PyArrayIterObject *iter = (PyArrayIterObject*)PyArray_IterNew(py_ary);
           
           while(iter->index < iter->size)
           {
           
               //////////////////////////////////////////////////////////
               // binary search
               //////////////////////////////////////////////////////////
               
               // This requires an update to weave 
               ary_data_type value = *((ary_data_type*)iter->dataptr);
               if (value>=bin_start)
               {
                   int bin_index = (int)((value-bin_start)/bin_size);
               
                   //////////////////////////////////////////////////////////
                   // Bin counter increment
                   //////////////////////////////////////////////////////////
    
                   // If the value was found, increment the counter for that bin.
                   if (bin_index < bin_count)
                   {
                       results[bin_index]++;
                   }    
                   PyArray_ITER_NEXT(iter);
               }    
           }
           """
    weave.inline(code, ['ary', 'bin_start', 'bin_size','bin_count', 'results'], 
                 type_converters=converters, 
                 compiler='gcc')
                 
    return results

_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Histograms of extremely large data sets

Reply via email to