Re: [Numpy-discussion] A way to compute a 2d histogram

David Huard Tue, 24 Oct 2006 13:40:36 -0700

Xavier,
Here is the patch against svn. Please report any bug. I haven't had the time to test it extensively, something that should be done before commiting the patch to the repo. I'd appreciate your feedback.

David

2006/10/24, David Huard <[EMAIL PROTECTED]>:

Hi Xavier,

You could tweak histogram2d to do what you want, or you could give me a couple of days and I'll do it and let you know. If you want to help, you could write a test using your particular application and data.

David

2006/10/24, Xavier Gnata < [EMAIL PROTECTED]>:
Hi,

I have a set of 3 1D large arrays.
The first 2 one stand for the coordinates of particules and the last one
for their masses.
I would like to be able to plot this data ie to compute a 2D histogram
summing the masses in each bin.
I cannot find a way to do that without any loop on the indices resulting
too a very slow function.

I'm looking for an elegant way to do that with numpy (or scipy??) function.

For instance, scipy.histogram2d cannot do the job because it only counts
the number of samples in each bin.
There is no way to deal with weights.

Xavier.

--
############################################
Xavier Gnata
CRAL - Observatoire de Lyon
9, avenue Charles André
69561 Saint Genis Laval cedex
Phone: +33 4 78 86 85 28
Fax: +33 4 78 86 83 86
E-mail: [EMAIL PROTECTED]
############################################

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Index: numpy/lib/twodim_base.py
===================================================================
--- numpy/lib/twodim_base.py	(rÃ©vision 3327)
+++ numpy/lib/twodim_base.py	(copie de travail)
@@ -143,105 +143,44 @@
         X[:,i] = x**(N-i-1)
     return X
 
-def histogram2d(x,y, bins=10, range=None, normed=False):
+    
+def histogram2d(x,y, bins=10, range=None, normed=False, weights=None):
     """histogram2d(x,y, bins=10, range=None, normed=False) -> H, xedges, yedges
     
     Compute the 2D histogram from samples x,y. 
 
     Parameters
     ----------
-    x,y: 1D data series. Both arrays must have the same length.
+    x,y: sample arrays.
     bins: Number of bins -or- [nbin x, nbin y] -or- 
          [bin edges] -or- [x bin edges, y bin edges].
     range:  A sequence of lower and upper bin edges (default: [min, max]).
-    normed: True or False. 
+    normed:  Boolean, if False, return the number of samples in each bin,
+             if True, returns the density.
+    weights: An array of weights. The weights are normed only if normed is True. 
+             Should weights.sum() not equal N, the total bin count will 
+             not be equal to the number of samples.
     
-    The histogram array is a count of the number of samples in each 
-    two dimensional bin. 
-    Setting normed to True returns a density rather than a bin count. 
+    Output
+    ------
+    hist:    Histogram array.
+    xedges, yedges: Arrays defining the bin edges. 
+    
+    Example:
+    x = random.randn(100,2)
+    hist2d, edges = histogram2d(x, bins = (6, 7))
+    
+    See also: histogramdd
     """
-    import numpy as np
+    from numpy import histogramdd
+    
     try:
         N = len(bins)
     except TypeError:
         N = 1
         bins = [bins]
-    x = asarray(x)
-    y = asarray(y)
-    if range is None:
-        xmin, xmax = x.min(), x.max()
-        ymin, ymax = y.min(), y.max()
-    else:
-        xmin, xmax = range[0]
-        ymin, ymax = range[1]
-    if N == 2:
-        if np.isscalar(bins[0]):
-            xnbin = bins[0]
-            xedges = np.linspace(xmin, xmax, xnbin+1)
-        else:
-            xedges = asarray(bins[0], float)
-            xnbin = len(xedges)-1
-        if np.isscalar(bins[1]):
-            ynbin = bins[1]
-            yedges = np.linspace(ymin, ymax, ynbin+1)
-        else:
-            yedges = asarray(bins[1], float)
-            ynbin = len(yedges)-1
-    elif N == 1:
-        ynbin = xnbin = bins[0]
-        xedges = np.linspace(xmin, xmax, xnbin+1)
-        yedges = np.linspace(ymin, ymax, ynbin+1)
-    else:
-        yedges = asarray(bins, float)
-        xedges = yedges.copy()
-        ynbin = len(yedges)-1
-        xnbin = len(xedges)-1
-    
-    dxedges = np.diff(xedges)
-    dyedges = np.diff(yedges)
-    
-    # Flattened histogram matrix (1D)
-    hist = np.zeros((xnbin)*(ynbin), int)
-
-    # Count the number of sample in each bin (1D)
-    xbin = np.digitize(x,xedges) 
-    ybin = np.digitize(y,yedges) 
- 
-    # Values that fall on an edge are put in the right bin.
-    # For the rightmost bin, we want values equal to the right 
-    # edge to be counted in the last bin, and not as an outlier.
-    xdecimal = int(-np.log10(dxedges.min()))+6
-    ydecimal = int(-np.log10(dyedges.min()))+6
-    on_edge_x = np.where(np.around(x,xdecimal) == np.around(xedges[-1], xdecimal))[0]
-    on_edge_y = np.where(np.around(y,ydecimal) == np.around(yedges[-1], ydecimal))[0]
-    xbin[on_edge_x] -= 1
-    ybin[on_edge_y] -= 1
-    # Remove the true outliers
-    outliers = (xbin==0) | (xbin==xnbin+1) | (ybin==0) | (ybin == ynbin+1)
-    xbin = xbin[outliers==False] - 1
-    ybin = ybin[outliers==False] - 1
-    
-    # Compute the sample indices in the flattened histogram matrix.
-    if xnbin >= ynbin:
-        xy = ybin*(xnbin) + xbin
-        
-    else:
-        xy = xbin*(ynbin) + ybin
-        
-       
-    # Compute the number of repetitions in xy and assign it to the flattened
-    #  histogram matrix.
-
-    flatcount = np.bincount(xy)
-    indices = np.arange(len(flatcount))
-    hist[indices] = flatcount
-
-    shape = np.sort([xnbin, ynbin])
-    # Shape into a proper matrix
-    histmat = hist.reshape(shape)
-    if (shape == (ynbin, xnbin)).all():
-        histmat = histmat.T
-    if normed:
-        diff2 = np.outer(dxedges, dyedges)
-        histmat = histmat / diff2 / histmat.sum()
-    return histmat, xedges, yedges
+    if N != 1 and N != 2:
+        xedges = yedges = asarray(bins, float)
+        bins = [xedges, yedges]
+    hist, edges = histogramdd([x,y], bins, range, normed, weights)
+    return hist, edges[0], edges[1]
Index: numpy/lib/tests/test_function_base.py
===================================================================
--- numpy/lib/tests/test_function_base.py	(rÃ©vision 3327)
+++ numpy/lib/tests/test_function_base.py	(copie de travail)
@@ -366,7 +366,7 @@
         [.5, .5, 1.5], [.5, 1.5, 2.5], [.5, 2.5, 2.5]])
         H, edges = histogramdd(x, (2,3,3), range = [[-1,1], [0,3], [0,3]])
         answer = asarray([[[0,1,0], [0,0,1], [1,0,0]], [[0,1,0], [0,0,1], [0,0,1]]])
-        assert(all(H == answer))
+        assert_array_equal(H,answer)
         # Check normalization
         ed = [[-2,0,2], [0,1,2,3], [0,1,2,3]]
         H, edges = histogramdd(x, bins = ed, normed = True)
@@ -384,7 +384,17 @@
                           [[0,0],[0,0],[0,0]]])
         assert_array_equal(H, answer)
                 
-
+    def check_weights(self):
+        v = rand(100,2)
+        hist, edges = histogramdd(v)
+        n_hist, edges = histogramdd(v, normed=True)
+        w_hist, edges = histogramdd(v, weights=ones(100))
+        assert_array_equal(w_hist, hist)
+        w_hist, edges = histogramdd(v, weights=ones(100)*2, normed=True)
+        assert_array_equal(w_hist, n_hist)
+        w_hist, edges = histogramdd(v, weights=ones(100, int)*2)
+        assert_array_equal(w_hist, 2*hist)
+            
 class test_unique(NumpyTestCase):
     def check_simple(self):
         x = array([4,3,2,1,1,2,3,4, 0])
Index: numpy/lib/function_base.py
===================================================================
--- numpy/lib/function_base.py	(rÃ©vision 3327)
+++ numpy/lib/function_base.py	(copie de travail)
@@ -103,45 +103,52 @@
     else:
         return n, bins
 
-def histogramdd(sample, bins=10, range=None, normed=False):
-    """histogramdd(sample, bins = 10, range = None, normed = False) -> H, edges
+def histogramdd(sample, bins=10, range=None, normed=False, weights=None):
+    """histogramdd(sample, bins=10, range=None, normed=False, weights=None) 
+                                                                -> hist, edges
     
-    Return the D-dimensional histogram computed from sample.
+    Return the D-dimensional histogram of the sample.
     
     Parameters
     ----------
-    sample: A sequence of D arrays, or an NxD array. 
-    bins:   A sequence of edge arrays, or a sequence of the number of bins. 
-            If a scalar is given, it is assumed to be the number of bins
-            for all dimensions. 
-    range:  A sequence of lower and upper bin edges (default: [min, max]).
-    normed: If False, returns the number of samples in each bin. 
-            If True, returns the frequency distribution.
+    sample:  A sequence of D arrays, or an NxD array. 
+    bins:    A sequence of edge arrays, 
+             or a sequence of the number of bins, 
+             or a scalar, the number of bins for all dimensions. 
+    range:   A sequence of lower and upper bin edges (default: [min, max]).
+    normed:  Boolean, if False, return the number of samples in each bin,
+             if True, returns the density.
+    weights: An array of weights. The weights are normed only if normed is True. 
+             Should weights.sum() not equal N, the total bin count will 
+             not be equal to the number of samples.
     
     
     Output
     ------
-    H:      Histogram array.
-    edges:  List of arrays defining the bin edges. 
+    hist:    Histogram array.
+    edges:   List of arrays defining the bin edges. 
     
     Example:
     x = random.randn(100,3)
-    H, edges = histogramdd(x, bins = (5, 6, 7))
+    hist3d, edges = histogramdd(x, bins = (5, 6, 7))
     
     See also: histogram
     """
     
-    try:
+    try: 
+        # Sample is an ND-array.
         N, D = sample.shape
-    except (AttributeError, ValueError):
-        ss = atleast_2d(sample)
-        sample = ss.transpose()
+    except (AttributeError, ValueError): 
+        # Sample is a sequence of 1D arrays.
+        sample = atleast_2d(sample).T
         N, D = sample.shape
     
     nbin = empty(D, int)
     edges = D*[None]
     dedges = D*[None]
-
+    if weights is not None:
+        weights = asarray(weights)
+    
     try:
         M = len(bins)
         if M != D:
@@ -149,6 +156,8 @@
     except TypeError:
         bins = D*[bins]
         
+    # Select range for each dimension
+    # Used only if number of bins is given.
     if range is None:
         smin = atleast_1d(sample.min(0))
         smax = atleast_1d(sample.max(0))
@@ -158,54 +167,51 @@
         for i in arange(D):
             smin[i], smax[i] = range[i]
     
+    # Create edge arrays
     for i in arange(D):
         if isscalar(bins[i]):
-            nbin[i] = bins[i]
-            edges[i] = linspace(smin[i], smax[i], nbin[i]+1)
+            nbin[i] = bins[i] + 2 # +2 for outlier bins
+            edges[i] = linspace(smin[i], smax[i], nbin[i]-1)
         else:
             edges[i] = asarray(bins[i], float)
-            nbin[i] = len(edges[i])-1
-            
-
-
+            nbin[i] = len(edges[i])+1  # +1 for outlier bins
+        dedges[i] = diff(edges[i])
+        
+    nbin =  asarray(nbin)
+    
+    # Compute the bin number each sample falls into. 
     Ncount = {}
-    nbin =  asarray(nbin)
-
     for i in arange(D):
         Ncount[i] = digitize(sample[:,i], edges[i]) 
-        dedges[i] = diff(edges[i])
-    # Remove values falling outside of bins
-    # Values that fall on an edge are put in the right bin.
+    
+    # Using digitize, values that fall on an edge are put in the right bin.
     # For the rightmost bin, we want values equal to the right 
     # edge to be counted in the last bin, and not as an outlier.
     outliers = zeros(N, int)
     for i in arange(D):
+        # Rounding precision
         decimal = int(-log10(dedges[i].min())) +6
+        # Find which points are on the rightmost edge.
         on_edge = where(around(sample[:,i], decimal) == around(edges[i][-1], decimal))[0]
+        # Shift these points one bin to the left. 
         Ncount[i][on_edge] -= 1
-        outliers += (Ncount[i] == 0) | (Ncount[i] == nbin[i]+1)
-    indices = where(outliers == 0)[0]
-    for i in arange(D):
-        Ncount[i] = Ncount[i][indices] - 1
-    N = len(indices)
-    
+        
     # Flattened histogram matrix (1D)
-    hist = zeros(nbin.prod(), int)
-    
+    hist = zeros(nbin.prod(), float)
+
     # Compute the sample indices in the flattened histogram matrix.
     ni = nbin.argsort()
     shape = []
     xy = zeros(N, int)
     for i in arange(0, D-1):
         xy += Ncount[ni[i]] * nbin[ni[i+1:]].prod()
-        
     xy += Ncount[ni[-1]]
 
     # Compute the number of repetitions in xy and assign it to the flattened histmat. 
     if len(xy) == 0:
-        return zeros(nbin, int)
+        return zeros(nbin-2, int), edges
         
-    flatcount = bincount(xy)
+    flatcount = bincount(xy, weights)
     a = arange(len(flatcount))
     hist[a] = flatcount
 
@@ -216,11 +222,16 @@
         if (hist.shape == nbin).all():
             break
     
+    # Remove outliers (indices 0 and -1 for each dimension).
+    core = D*[slice(1,-1)]
+    hist = hist[core]
+    
+    # Normalize if normed is True
     if normed:
         s = hist.sum()
         for i in arange(D):
             shape = ones(D, int)
-            shape[i] = nbin[i]
+            shape[i] = nbin[i]-2
             hist = hist / dedges[i].reshape(shape)
         hist /= s

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Re: [Numpy-discussion] A way to compute a 2d histogram

Reply via email to