Hi,

Once again there has been a thread on the numpy/scipy mailing lists
requesting (essentially) some form of spatial data structure. Pointers
have been posted to ANN (sadly LGPLed and in C++) as well as a handful
of pure-python implementations of kd-trees. I suggest the creation of
a new submodule of scipy, scipy.spatial, to contain spatial data
structures and algorithms. Specifically, I propose it contain a
kd-tree implementation providing nearest-neighbor, approximate
nearest-neighbor, and all-points-near queries. I have a few other
suggestions for things it might contain, but kd-trees seem like a good
first step.

2008/9/27 Nathan Bell <[EMAIL PROTECTED]>:
> On Sat, Sep 27, 2008 at 11:18 PM, Anne Archibald
> <[EMAIL PROTECTED]> wrote:
>>
>> I think a kd-tree implementation would be a valuable addition to
>> scipy, perhaps in a submodule scipy.spatial that might eventually
>> contain other spatial data structures and algorithms. What do you
>> think? Should we have one? Should it be based on Sturla Molden's code,
>> if the license permits? I am willing to contribute one, if not.
>
> +1

Judging that your vote and mine are enough in the absence of
dissenting voices, I have written an implementation based on yours,
Sturla Molden's, and the algorithms described by the authors of the
ANN library. Before integrating it into scipy, though, I'd like to
send it around for comments.

Particular issues:

* It's pure python right now, but with some effort it could be
partially or completely cythonized. This is probably a good idea in
the long run. In the meantime I have crudely vectorized it so that
users can at least avoid looping in their own code.
* It is able to return the r nearest neighbors, optionally subject to
a maximum distance limit, as well as approximate nearest neighbors.
* It is not currently set up to conveniently return all points within
some fixed distance of the target point, but this could easily be
added.
* It returns distances and indices of nearest neighbors in the original array.
* The testing code is, frankly, a mess. I need to look into using nose
in a sensible fashion.
* The license is the scipy license.

I am particularly concerned about providing a convenient return
format. The natural return from a query is a list of neighbors, since
it may have variable length (there may not be r elements in the tree,
or you might have supplied a maximum distance which doesn't contain r
points). For a single query, it's simple to return a python list
(should it be sorted? currently it's a heap). But if you want to
vectorize the process, storing the results in an array becomes
cumbersome. One option is an object array full of lists; another,
which, I currently use, is an array with nonexistent points marked
with an infinite distance and an invalid index. A third would be to
return masked arrays. How do you recommend handling this variable (or
potentially-variable) sized output?

> If you're implementing one, I would highly recommend the
> "left-balanced" kd-tree.
> http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/2535/pdf/imm2535.pdf

Research suggests that at least in high dimension a more geometric
balancing criterion can produce vastly better results. I used the
"sliding midpoint" rule, which doesn't allow such a numerical
balancing but does ensure that you don't have too many long skinny
cells (since a sphere tends to cut very many of these).

I've also been thinking about what else would go in scipy.spatial. I
think it would be valuable to have a reasonably efficient distance
matrix function (we seem to get that question a lot, and the answer's
not trivial) as well as a sparse distance matrix function based on the
kd-trees. The module could potentially also swallow the (currently
sandboxed?) delaunay code.

Anne
# Copyright Anne M. Archibald 2008
# Released under the scipy license
import numpy as np
from heapq import heappush, heappop

def distance_p(x,y,p=2):
    if p==np.inf:
        return np.amax(np.abs(y-x),axis=-1)
    elif p==1:
        return np.sum(np.abs(y-x),axis=-1)
    else:
        return np.sum(np.abs(y-x)**p,axis=-1)
def distance(x,y,p=2):
    if p==np.inf or p==1:
        return distance_p(x,y,p)
    else:
        return distance_p(x,y,p)**(1./p)


class KDTree(object):
    """kd-tree for quick nearest-neighbor lookup

    This class provides an index into a set of k-dimensional points
    which can be used to rapidly look up the nearest neighbors of any
    point. 

    The algorithm used is described in Maneewongvatana and Mount 1999. 
    The general idea is that the kd-tree is a binary trie, each of whose
    nodes represents an axis-aligned hyperrectangle. Each node specifies
    an axis and splits the set of points based on whether their coordinate
    along that axis is greater than or less than a particular value. 

    During construction, the axis and splitting point are chosen by the 
    "sliding midpoint" rule, which ensures that the cells do not all
    become long and thin. 

    The tree can be queried for the r closest neighbors of any given point 
    (optionally returning only those within some maximum distance of the 
    point). It can also be queried, with a substantial gain in efficiency, 
    for the r approximate closest neighbors.

    For large dimensions (20 is already large) do not expect this to run 
    significantly faster than brute force. High-dimensional nearest-neighbor
    queries are a substantial open problem in computer science.
    """

    def __init__(self, data, leafsize=10):
        """Construct a kd-tree.

        Parameters:
        ===========

        data : array-like, shape (n,k)
            The data points to be indexed. This array is not copied, and
            so modifying this data will result in bogus results.
        leafsize : positive integer
            The number of points at which the algorithm switches over to
            brute-force.
        """
        self.data = np.asarray(data)
        self.n, self.k = np.shape(self.data)
        self.leafsize = int(leafsize)
        if self.leafsize<1:
            raise ValueError("leafsize must be at least 1")
        self.maxes = np.amax(self.data,axis=0)
        self.mins = np.amin(self.data,axis=0)

        self.tree = self.__build(np.arange(self.n), self.maxes, self.mins)

    class node(object):
        pass
    class leafnode(node):
        def __init__(self, idx):
            self.idx = idx
    class innernode(node):
        def __init__(self, split_dim, split, less, greater):
            self.split_dim = split_dim
            self.split = split
            self.less = less
            self.greater = greater
    
    def __build(self, idx, maxes, mins):
        if len(idx)<=self.leafsize:
            return KDTree.leafnode(idx)
        else:
            data = self.data[idx]
            #maxes = np.amax(data,axis=0)
            #mins = np.amin(data,axis=0)
            d = np.argmax(maxes-mins)
            maxval = maxes[d]
            minval = mins[d]
            if maxval==minval:
                # all points are identical; warn user?
                return KDTree.leafnode(idx)
            data = data[:,d]

            # sliding midpoint rule; see Maneewongvatana and Mount 1999
            # for arguments that this is a good idea.
            split = (maxval+minval)/2
            less_idx = np.nonzero(data<=split)[0]
            greater_idx = np.nonzero(data>split)[0]
            if len(less_idx)==0:
                split = np.amin(data)
                less_idx = np.nonzero(data<=split)[0]
                greater_idx = np.nonzero(data>split)[0]
            if len(greater_idx)==0:
                split = np.amax(data)
                less_idx = np.nonzero(data<split)[0]
                greater_idx = np.nonzero(data>=split)[0]
            if len(less_idx)==0:
                # _still_ zero? all must have the same value
                assert np.all(data==data[0]), "Troublesome data array: %s" % data
                split = data[0]
                less_idx = np.arange(len(data)-1)
                greater_idx = np.array([len(data)-1])

            lessmaxes = np.copy(maxes)
            lessmaxes[d] = split
            greatermins = np.copy(mins)
            greatermins[d] = split
            return KDTree.innernode(d, split, 
                    self.__build(idx[less_idx],lessmaxes,mins),
                    self.__build(idx[greater_idx],maxes,greatermins))

    def __query(self, x, k=1, eps=0, p=2, distance_upper_bound=np.inf):
        
        side_distances = [max(0,x[i]-self.maxes[i],self.mins[i]-x[i]) for i in range(self.k)]
        # priority queue for chasing nodes
        # entries are:
        #  minimum distance between the cell and the target
        #  distances between the nearest side of the cell and the target
        #  the head node of the cell
        q = [(distance_p(np.array(side_distances),0),
              tuple(side_distances),                   
              self.tree)]
        # priority queue for the nearest neighbors
        # furthest known neighbor first
        # entries are (-distance**p, i)
        neighbors = []

        if eps==0:
            epsfac=1
        elif p==np.inf:
            epsfac = 1/(1+eps)
        else:
            epsfac = 1/(1+eps)**p

        if p!=np.inf and distance_upper_bound!=np.inf:
            distance_upper_bound = distance_upper_bound**p

        while q:
            min_distance, side_distances, node = heappop(q)
            if isinstance(node, KDTree.leafnode):
                # brute-force
                data = self.data[node.idx]
                a = np.abs(data-x[np.newaxis,:])
                if p==np.inf:
                    ds = np.amax(a,axis=1)
                elif p==1:
                    ds = np.sum(a,axis=1)
                else:
                    ds = np.sum(a**p,axis=1)
                for i in range(len(ds)):
                    if ds[i]<distance_upper_bound:
                        if len(neighbors)==k:
                            heappop(neighbors)
                        heappush(neighbors, (-ds[i], node.idx[i]))
                        if len(neighbors)==k:
                            distance_upper_bound = -neighbors[0][0]
            else:
                # we don't push cells that are too far onto the queue at all,
                # but since the distance_upper_bound decreases, we might get 
                # here even if the cell's too far
                if min_distance>distance_upper_bound*epsfac:
                    # since this is the nearest cell, we're done, bail out
                    break
                # compute minimum distances to the children and push them on
                if x[node.split_dim]<node.split:
                    near, far = node.less, node.greater
                else:
                    near, far = node.greater, node.less

                # near child is at the same distance as the current node
                heappush(q,(min_distance, side_distances, near))

                # far child is further by an amount depending only
                # on the split value
                sd = list(side_distances)
                if p == np.inf:
                    min_distance = max(min_distance, abs(node.split-x[node.split_dim]))
                elif p == 1:
                    sd[node.split_dim] = np.abs(node.split-x[node.split_dim])
                    min_distance = min_distance - side_distances[node.split_dim] + sd[node.split_dim]
                else:
                    sd[node.split_dim] = np.abs(node.split-x[node.split_dim])**p
                    min_distance = min_distance - side_distances[node.split_dim] + sd[node.split_dim]

                # far child might be too far, if so, don't bother pushing it
                if min_distance<=distance_upper_bound*epsfac:
                    heappush(q,(min_distance, tuple(sd), far))

        if p==np.inf:
            return [(-d,i) for (d,i) in neighbors]
        else:
            return [((-d)**(1./p),i) for (d,i) in neighbors]

    def query(self, x, k=1, eps=0, p=2, distance_upper_bound=np.inf):
        """query the kd-tree for nearest neighbors

        Parameters:
        ===========

        x : array-like, last dimension self.k
            An array of points to query.
        k : integer
            The number of nearest neighbors to return.
        eps : nonnegative float
            Return approximate nearest neighbors; the kth returned value 
            is guaranteed to be no further than (1+eps) times the 
            distance to the real kth nearest neighbor.
        p : float, 1<=p<=infinity
            Which Minkowski p-norm to use. 
            1 is the sum-of-absolute-values "Manhattan" distance
            2 is the usual Euclidean distance
            infinity is the maximum-coordinate-difference distance
        distance_upper_bound : nonnegative float
            Return only neighbors within this distance. This is used to prune
            tree searches, so if you are doing a series of nearest-neighbor
            queries, it may help to supply the distance to the nearest neighbor
            of the most recent point.

        Returns:
        ========
        
        d : array of floats
            The distances to the nearest neighbors. 
            If x has shape tuple+(self.k,), then d has shape tuple+(k,).
            Missing neighbors are indicated with infinite distances.
        i : array of integers
            The locations of the neighbors in self.data.
            If x has shape tuple+(self.k,), then i has shape tuple+(k,).
            Missing neighbors are indicated with self.n+1.
        """
        x = np.asarray(x)
        if np.shape(x)[-1] != self.k:
            raise ValueError("x must consist of vectors of length %d but has shape %s" % (self.k, np.shape(x)))
        if p<1:
            raise ValueError("Only p-norms with 1<=p<=infinity permitted")
        retshape = np.shape(x)[:-1]
        dd = np.empty(retshape+(k,),dtype=np.float)
        dd.fill(np.inf)
        ii = np.empty(retshape+(k,),dtype=np.int)
        ii.fill(self.n+1)
        for c in np.ndindex(retshape):
            hits = self.__query(x[c], k=k, p=p, distance_upper_bound=distance_upper_bound)
            for j in range(len(hits)):
                dd[c+(j,)], ii[c+(j,)] = hits[j]
        return dd, ii



# Copyright Anne M. Archibald 2008
# Released under the scipy license
from numpy.testing import *

import numpy as np
import kdtree

class CheckSmall(NumpyTestCase):
    def setUp(self):
        self.data = np.array([[0,0,0],
                              [0,0,1],
                              [0,1,0],
                              [0,1,1],
                              [1,0,0],
                              [1,0,1],
                              [1,1,0],
                              [1,1,1]])
        self.kdtree = kdtree.KDTree(self.data)

    def test_nearest(self):
        assert_array_equal(
                self.kdtree.query((0,0,0.1), 1),
                ([0.1],[0]))
    def test_nearest_two(self):
        assert_array_equal(
                self.kdtree.query((0,0,0.1), 2),
                ([0.9,0.1],[1,0]))
class CheckSmallNonLeaf(NumpyTestCase):
    def setUp(self):
        self.data = np.array([[0,0,0],
                              [0,0,1],
                              [0,1,0],
                              [0,1,1],
                              [1,0,0],
                              [1,0,1],
                              [1,1,0],
                              [1,1,1]])
        self.kdtree = kdtree.KDTree(self.data,leafsize=1)

    def test_nearest(self):
        assert_array_equal(
                self.kdtree.query((0,0,0.1), 1),
                ([0.1],[0]))
    def test_nearest_two(self):
        assert_array_equal(
                self.kdtree.query((0,0,0.1), 2),
                ([0.9,0.1],[1,0]))

class CheckRandom(NumpyTestCase):
    def setUp(self):
        self.n = 1000
        self.k = 4
        self.data = np.random.randn(self.n, self.k)
        self.kdtree = kdtree.KDTree(self.data)

    def test_nearest(self):
        x = np.random.randn(self.k)
        d, i = self.kdtree.query(x, 1)
        assert_almost_equal(d**2,np.sum((x-self.data[i])**2))
        eps = 1e-8
        assert np.all(np.sum((self.data-x[np.newaxis,:])**2,axis=1)>d**2-eps)
        
    def test_m_nearest(self):
        x = np.random.randn(self.k)
        m = 10
        dd, ii = self.kdtree.query(x, m)
        d = np.amax(dd)
        i = ii[np.argmax(dd)]
        assert_almost_equal(d**2,np.sum((x-self.data[i])**2))
        eps = 1e-8
        assert_equal(np.sum(np.sum((self.data-x[np.newaxis,:])**2,axis=1)<d**2+eps),m)

    def test_points_near(self):
        x = np.random.randn(self.k)
        d = 0.2
        dd, ii = self.kdtree.query(x, k=self.kdtree.n, distance_upper_bound=d)
        eps = 1e-8
        hits = 0
        for near_d, near_i in zip(dd,ii):
            if near_d==np.inf:
                continue
            hits += 1
            assert_almost_equal(near_d**2,np.sum((x-self.data[near_i])**2))
            assert near_d<d+eps, "near_d=%g should be less than %g" % (near_d,d)
        assert_equal(np.sum(np.sum((self.data-x[np.newaxis,:])**2,axis=1)<d**2+eps),hits)

    def test_points_near_l1(self):
        x = np.random.randn(self.k)
        d = 0.2
        dd, ii = self.kdtree.query(x, k=self.kdtree.n, p=1, distance_upper_bound=d)
        eps = 1e-8
        hits = 0
        for near_d, near_i in zip(dd,ii):
            if near_d==np.inf:
                continue
            hits += 1
            assert_almost_equal(near_d,kdtree.distance(x,self.data[near_i],1))
            assert near_d<d+eps, "near_d=%g should be less than %g" % (near_d,d)
        assert_equal(np.sum(kdtree.distance(self.data,x,1)<d+eps),hits)
    def test_points_near_linf(self):
        x = np.random.randn(self.k)
        d = 0.2
        dd, ii = self.kdtree.query(x, k=self.kdtree.n, p=np.inf, distance_upper_bound=d)
        eps = 1e-8
        hits = 0
        for near_d, near_i in zip(dd,ii):
            if near_d==np.inf:
                continue
            hits += 1
            assert_almost_equal(near_d,kdtree.distance(x,self.data[near_i],np.inf))
            assert near_d<d+eps, "near_d=%g should be less than %g" % (near_d,d)
        assert_equal(np.sum(kdtree.distance(self.data,x,np.inf)<d+eps),hits)

    def test_approx(self):
        x = np.random.randn(self.k)
        m = 10
        eps = 0.1
        d_real, i_real = self.kdtree.query(x, m)
        d, i = self.kdtree.query(x, m, eps=eps)
        assert np.all(d<=d_real*(1+eps))

        
    
if __name__=='__main__':
    import unittest
    unittest.main()


import numpy as np
import pylab as plt

import kdtree

def plot_kdtree(T):
    maxes = np.amax(T.data,axis=0)
    mins = np.amin(T.data,axis=0)
    def plot_node(n,maxes,mins):
        plt.plot([maxes[0],maxes[0],mins[0],mins[0],maxes[0]],
                 [maxes[1],mins[1],mins[1],maxes[1],maxes[1]],
                 "-",color="blue")
        if isinstance(n,kdtree.KDTree.leafnode):
            assert np.all(T.data[n.idx]<=maxes)
            assert np.all(T.data[n.idx]>=mins)
            plt.plot(T.data[n.idx,0],T.data[n.idx,1],
                     "o",color="black")
        else:
            lessmaxes = np.copy(maxes)
            lessmaxes[n.split_dim] = n.split
            plot_node(n.less,lessmaxes,mins)
            greatermins = np.copy(mins)
            greatermins[n.split_dim] = n.split
            plot_node(n.greater,maxes,greatermins)
    plot_node(T.tree, maxes, mins)
    plt.xlim(np.amin(mins),np.amax(maxes))
    plt.ylim(np.amin(mins),np.amax(maxes))
    plt.axis('off')
    plt.gcf().set_size_inches((5,5))

if __name__=='__main__':
    s = 2
    data = np.concatenate((np.random.randn(100,2),
                           np.random.randn(100,2)+(2*s,2*s),
                           np.random.randn(100,2)+(0,-3*s),
                           np.random.randn(100,2)+(-3*s,-1*s)),axis=0)
    T = kdtree.KDTree(data,leafsize=1)
    plot_kdtree(T)
    plt.savefig('kdtree.png')
_______________________________________________
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to