Hi, Once again there has been a thread on the numpy/scipy mailing lists requesting (essentially) some form of spatial data structure. Pointers have been posted to ANN (sadly LGPLed and in C++) as well as a handful of pure-python implementations of kd-trees. I suggest the creation of a new submodule of scipy, scipy.spatial, to contain spatial data structures and algorithms. Specifically, I propose it contain a kd-tree implementation providing nearest-neighbor, approximate nearest-neighbor, and all-points-near queries. I have a few other suggestions for things it might contain, but kd-trees seem like a good first step.
2008/9/27 Nathan Bell <[EMAIL PROTECTED]>: > On Sat, Sep 27, 2008 at 11:18 PM, Anne Archibald > <[EMAIL PROTECTED]> wrote: >> >> I think a kd-tree implementation would be a valuable addition to >> scipy, perhaps in a submodule scipy.spatial that might eventually >> contain other spatial data structures and algorithms. What do you >> think? Should we have one? Should it be based on Sturla Molden's code, >> if the license permits? I am willing to contribute one, if not. > > +1 Judging that your vote and mine are enough in the absence of dissenting voices, I have written an implementation based on yours, Sturla Molden's, and the algorithms described by the authors of the ANN library. Before integrating it into scipy, though, I'd like to send it around for comments. Particular issues: * It's pure python right now, but with some effort it could be partially or completely cythonized. This is probably a good idea in the long run. In the meantime I have crudely vectorized it so that users can at least avoid looping in their own code. * It is able to return the r nearest neighbors, optionally subject to a maximum distance limit, as well as approximate nearest neighbors. * It is not currently set up to conveniently return all points within some fixed distance of the target point, but this could easily be added. * It returns distances and indices of nearest neighbors in the original array. * The testing code is, frankly, a mess. I need to look into using nose in a sensible fashion. * The license is the scipy license. I am particularly concerned about providing a convenient return format. The natural return from a query is a list of neighbors, since it may have variable length (there may not be r elements in the tree, or you might have supplied a maximum distance which doesn't contain r points). For a single query, it's simple to return a python list (should it be sorted? currently it's a heap). But if you want to vectorize the process, storing the results in an array becomes cumbersome. One option is an object array full of lists; another, which, I currently use, is an array with nonexistent points marked with an infinite distance and an invalid index. A third would be to return masked arrays. How do you recommend handling this variable (or potentially-variable) sized output? > If you're implementing one, I would highly recommend the > "left-balanced" kd-tree. > http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/2535/pdf/imm2535.pdf Research suggests that at least in high dimension a more geometric balancing criterion can produce vastly better results. I used the "sliding midpoint" rule, which doesn't allow such a numerical balancing but does ensure that you don't have too many long skinny cells (since a sphere tends to cut very many of these). I've also been thinking about what else would go in scipy.spatial. I think it would be valuable to have a reasonably efficient distance matrix function (we seem to get that question a lot, and the answer's not trivial) as well as a sparse distance matrix function based on the kd-trees. The module could potentially also swallow the (currently sandboxed?) delaunay code. Anne
# Copyright Anne M. Archibald 2008 # Released under the scipy license import numpy as np from heapq import heappush, heappop def distance_p(x,y,p=2): if p==np.inf: return np.amax(np.abs(y-x),axis=-1) elif p==1: return np.sum(np.abs(y-x),axis=-1) else: return np.sum(np.abs(y-x)**p,axis=-1) def distance(x,y,p=2): if p==np.inf or p==1: return distance_p(x,y,p) else: return distance_p(x,y,p)**(1./p) class KDTree(object): """kd-tree for quick nearest-neighbor lookup This class provides an index into a set of k-dimensional points which can be used to rapidly look up the nearest neighbors of any point. The algorithm used is described in Maneewongvatana and Mount 1999. The general idea is that the kd-tree is a binary trie, each of whose nodes represents an axis-aligned hyperrectangle. Each node specifies an axis and splits the set of points based on whether their coordinate along that axis is greater than or less than a particular value. During construction, the axis and splitting point are chosen by the "sliding midpoint" rule, which ensures that the cells do not all become long and thin. The tree can be queried for the r closest neighbors of any given point (optionally returning only those within some maximum distance of the point). It can also be queried, with a substantial gain in efficiency, for the r approximate closest neighbors. For large dimensions (20 is already large) do not expect this to run significantly faster than brute force. High-dimensional nearest-neighbor queries are a substantial open problem in computer science. """ def __init__(self, data, leafsize=10): """Construct a kd-tree. Parameters: =========== data : array-like, shape (n,k) The data points to be indexed. This array is not copied, and so modifying this data will result in bogus results. leafsize : positive integer The number of points at which the algorithm switches over to brute-force. """ self.data = np.asarray(data) self.n, self.k = np.shape(self.data) self.leafsize = int(leafsize) if self.leafsize<1: raise ValueError("leafsize must be at least 1") self.maxes = np.amax(self.data,axis=0) self.mins = np.amin(self.data,axis=0) self.tree = self.__build(np.arange(self.n), self.maxes, self.mins) class node(object): pass class leafnode(node): def __init__(self, idx): self.idx = idx class innernode(node): def __init__(self, split_dim, split, less, greater): self.split_dim = split_dim self.split = split self.less = less self.greater = greater def __build(self, idx, maxes, mins): if len(idx)<=self.leafsize: return KDTree.leafnode(idx) else: data = self.data[idx] #maxes = np.amax(data,axis=0) #mins = np.amin(data,axis=0) d = np.argmax(maxes-mins) maxval = maxes[d] minval = mins[d] if maxval==minval: # all points are identical; warn user? return KDTree.leafnode(idx) data = data[:,d] # sliding midpoint rule; see Maneewongvatana and Mount 1999 # for arguments that this is a good idea. split = (maxval+minval)/2 less_idx = np.nonzero(data<=split)[0] greater_idx = np.nonzero(data>split)[0] if len(less_idx)==0: split = np.amin(data) less_idx = np.nonzero(data<=split)[0] greater_idx = np.nonzero(data>split)[0] if len(greater_idx)==0: split = np.amax(data) less_idx = np.nonzero(data<split)[0] greater_idx = np.nonzero(data>=split)[0] if len(less_idx)==0: # _still_ zero? all must have the same value assert np.all(data==data[0]), "Troublesome data array: %s" % data split = data[0] less_idx = np.arange(len(data)-1) greater_idx = np.array([len(data)-1]) lessmaxes = np.copy(maxes) lessmaxes[d] = split greatermins = np.copy(mins) greatermins[d] = split return KDTree.innernode(d, split, self.__build(idx[less_idx],lessmaxes,mins), self.__build(idx[greater_idx],maxes,greatermins)) def __query(self, x, k=1, eps=0, p=2, distance_upper_bound=np.inf): side_distances = [max(0,x[i]-self.maxes[i],self.mins[i]-x[i]) for i in range(self.k)] # priority queue for chasing nodes # entries are: # minimum distance between the cell and the target # distances between the nearest side of the cell and the target # the head node of the cell q = [(distance_p(np.array(side_distances),0), tuple(side_distances), self.tree)] # priority queue for the nearest neighbors # furthest known neighbor first # entries are (-distance**p, i) neighbors = [] if eps==0: epsfac=1 elif p==np.inf: epsfac = 1/(1+eps) else: epsfac = 1/(1+eps)**p if p!=np.inf and distance_upper_bound!=np.inf: distance_upper_bound = distance_upper_bound**p while q: min_distance, side_distances, node = heappop(q) if isinstance(node, KDTree.leafnode): # brute-force data = self.data[node.idx] a = np.abs(data-x[np.newaxis,:]) if p==np.inf: ds = np.amax(a,axis=1) elif p==1: ds = np.sum(a,axis=1) else: ds = np.sum(a**p,axis=1) for i in range(len(ds)): if ds[i]<distance_upper_bound: if len(neighbors)==k: heappop(neighbors) heappush(neighbors, (-ds[i], node.idx[i])) if len(neighbors)==k: distance_upper_bound = -neighbors[0][0] else: # we don't push cells that are too far onto the queue at all, # but since the distance_upper_bound decreases, we might get # here even if the cell's too far if min_distance>distance_upper_bound*epsfac: # since this is the nearest cell, we're done, bail out break # compute minimum distances to the children and push them on if x[node.split_dim]<node.split: near, far = node.less, node.greater else: near, far = node.greater, node.less # near child is at the same distance as the current node heappush(q,(min_distance, side_distances, near)) # far child is further by an amount depending only # on the split value sd = list(side_distances) if p == np.inf: min_distance = max(min_distance, abs(node.split-x[node.split_dim])) elif p == 1: sd[node.split_dim] = np.abs(node.split-x[node.split_dim]) min_distance = min_distance - side_distances[node.split_dim] + sd[node.split_dim] else: sd[node.split_dim] = np.abs(node.split-x[node.split_dim])**p min_distance = min_distance - side_distances[node.split_dim] + sd[node.split_dim] # far child might be too far, if so, don't bother pushing it if min_distance<=distance_upper_bound*epsfac: heappush(q,(min_distance, tuple(sd), far)) if p==np.inf: return [(-d,i) for (d,i) in neighbors] else: return [((-d)**(1./p),i) for (d,i) in neighbors] def query(self, x, k=1, eps=0, p=2, distance_upper_bound=np.inf): """query the kd-tree for nearest neighbors Parameters: =========== x : array-like, last dimension self.k An array of points to query. k : integer The number of nearest neighbors to return. eps : nonnegative float Return approximate nearest neighbors; the kth returned value is guaranteed to be no further than (1+eps) times the distance to the real kth nearest neighbor. p : float, 1<=p<=infinity Which Minkowski p-norm to use. 1 is the sum-of-absolute-values "Manhattan" distance 2 is the usual Euclidean distance infinity is the maximum-coordinate-difference distance distance_upper_bound : nonnegative float Return only neighbors within this distance. This is used to prune tree searches, so if you are doing a series of nearest-neighbor queries, it may help to supply the distance to the nearest neighbor of the most recent point. Returns: ======== d : array of floats The distances to the nearest neighbors. If x has shape tuple+(self.k,), then d has shape tuple+(k,). Missing neighbors are indicated with infinite distances. i : array of integers The locations of the neighbors in self.data. If x has shape tuple+(self.k,), then i has shape tuple+(k,). Missing neighbors are indicated with self.n+1. """ x = np.asarray(x) if np.shape(x)[-1] != self.k: raise ValueError("x must consist of vectors of length %d but has shape %s" % (self.k, np.shape(x))) if p<1: raise ValueError("Only p-norms with 1<=p<=infinity permitted") retshape = np.shape(x)[:-1] dd = np.empty(retshape+(k,),dtype=np.float) dd.fill(np.inf) ii = np.empty(retshape+(k,),dtype=np.int) ii.fill(self.n+1) for c in np.ndindex(retshape): hits = self.__query(x[c], k=k, p=p, distance_upper_bound=distance_upper_bound) for j in range(len(hits)): dd[c+(j,)], ii[c+(j,)] = hits[j] return dd, ii
# Copyright Anne M. Archibald 2008 # Released under the scipy license from numpy.testing import * import numpy as np import kdtree class CheckSmall(NumpyTestCase): def setUp(self): self.data = np.array([[0,0,0], [0,0,1], [0,1,0], [0,1,1], [1,0,0], [1,0,1], [1,1,0], [1,1,1]]) self.kdtree = kdtree.KDTree(self.data) def test_nearest(self): assert_array_equal( self.kdtree.query((0,0,0.1), 1), ([0.1],[0])) def test_nearest_two(self): assert_array_equal( self.kdtree.query((0,0,0.1), 2), ([0.9,0.1],[1,0])) class CheckSmallNonLeaf(NumpyTestCase): def setUp(self): self.data = np.array([[0,0,0], [0,0,1], [0,1,0], [0,1,1], [1,0,0], [1,0,1], [1,1,0], [1,1,1]]) self.kdtree = kdtree.KDTree(self.data,leafsize=1) def test_nearest(self): assert_array_equal( self.kdtree.query((0,0,0.1), 1), ([0.1],[0])) def test_nearest_two(self): assert_array_equal( self.kdtree.query((0,0,0.1), 2), ([0.9,0.1],[1,0])) class CheckRandom(NumpyTestCase): def setUp(self): self.n = 1000 self.k = 4 self.data = np.random.randn(self.n, self.k) self.kdtree = kdtree.KDTree(self.data) def test_nearest(self): x = np.random.randn(self.k) d, i = self.kdtree.query(x, 1) assert_almost_equal(d**2,np.sum((x-self.data[i])**2)) eps = 1e-8 assert np.all(np.sum((self.data-x[np.newaxis,:])**2,axis=1)>d**2-eps) def test_m_nearest(self): x = np.random.randn(self.k) m = 10 dd, ii = self.kdtree.query(x, m) d = np.amax(dd) i = ii[np.argmax(dd)] assert_almost_equal(d**2,np.sum((x-self.data[i])**2)) eps = 1e-8 assert_equal(np.sum(np.sum((self.data-x[np.newaxis,:])**2,axis=1)<d**2+eps),m) def test_points_near(self): x = np.random.randn(self.k) d = 0.2 dd, ii = self.kdtree.query(x, k=self.kdtree.n, distance_upper_bound=d) eps = 1e-8 hits = 0 for near_d, near_i in zip(dd,ii): if near_d==np.inf: continue hits += 1 assert_almost_equal(near_d**2,np.sum((x-self.data[near_i])**2)) assert near_d<d+eps, "near_d=%g should be less than %g" % (near_d,d) assert_equal(np.sum(np.sum((self.data-x[np.newaxis,:])**2,axis=1)<d**2+eps),hits) def test_points_near_l1(self): x = np.random.randn(self.k) d = 0.2 dd, ii = self.kdtree.query(x, k=self.kdtree.n, p=1, distance_upper_bound=d) eps = 1e-8 hits = 0 for near_d, near_i in zip(dd,ii): if near_d==np.inf: continue hits += 1 assert_almost_equal(near_d,kdtree.distance(x,self.data[near_i],1)) assert near_d<d+eps, "near_d=%g should be less than %g" % (near_d,d) assert_equal(np.sum(kdtree.distance(self.data,x,1)<d+eps),hits) def test_points_near_linf(self): x = np.random.randn(self.k) d = 0.2 dd, ii = self.kdtree.query(x, k=self.kdtree.n, p=np.inf, distance_upper_bound=d) eps = 1e-8 hits = 0 for near_d, near_i in zip(dd,ii): if near_d==np.inf: continue hits += 1 assert_almost_equal(near_d,kdtree.distance(x,self.data[near_i],np.inf)) assert near_d<d+eps, "near_d=%g should be less than %g" % (near_d,d) assert_equal(np.sum(kdtree.distance(self.data,x,np.inf)<d+eps),hits) def test_approx(self): x = np.random.randn(self.k) m = 10 eps = 0.1 d_real, i_real = self.kdtree.query(x, m) d, i = self.kdtree.query(x, m, eps=eps) assert np.all(d<=d_real*(1+eps)) if __name__=='__main__': import unittest unittest.main()
import numpy as np import pylab as plt import kdtree def plot_kdtree(T): maxes = np.amax(T.data,axis=0) mins = np.amin(T.data,axis=0) def plot_node(n,maxes,mins): plt.plot([maxes[0],maxes[0],mins[0],mins[0],maxes[0]], [maxes[1],mins[1],mins[1],maxes[1],maxes[1]], "-",color="blue") if isinstance(n,kdtree.KDTree.leafnode): assert np.all(T.data[n.idx]<=maxes) assert np.all(T.data[n.idx]>=mins) plt.plot(T.data[n.idx,0],T.data[n.idx,1], "o",color="black") else: lessmaxes = np.copy(maxes) lessmaxes[n.split_dim] = n.split plot_node(n.less,lessmaxes,mins) greatermins = np.copy(mins) greatermins[n.split_dim] = n.split plot_node(n.greater,maxes,greatermins) plot_node(T.tree, maxes, mins) plt.xlim(np.amin(mins),np.amax(maxes)) plt.ylim(np.amin(mins),np.amax(maxes)) plt.axis('off') plt.gcf().set_size_inches((5,5)) if __name__=='__main__': s = 2 data = np.concatenate((np.random.randn(100,2), np.random.randn(100,2)+(2*s,2*s), np.random.randn(100,2)+(0,-3*s), np.random.randn(100,2)+(-3*s,-1*s)),axis=0) T = kdtree.KDTree(data,leafsize=1) plot_kdtree(T) plt.savefig('kdtree.png')
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion