On Wed, May 2, 2012 at 5:27 PM, Kevin Jacobs <jac...@bioinformed.com> <bioinfor...@gmail.com> wrote: > On Wed, May 2, 2012 at 5:45 PM, Moroney, Catherine M (388D) > <catherine.m.moro...@jpl.nasa.gov> wrote: >> >> Thanks to Perry for some very useful off-list conversation. I realize >> that >> I wasn't being clear at all in my earlier description of the problem so >> here it is >> in a nutshell: >> >> Find the best match in an array t(5000, 7) for a single vector e(7). Now >> scale >> it up so e is (128, 512, 7) and I want to return a (128, 512) array of the >> t-identifiers >> that are the best match for e. "Best match" is defined as the minimum >> Euclidean distance. >> > > It sounds like you want to find the nearest neighbor to a point in a > high-dimensional space. This sounds like a job for a spacial data structure > like a KD-tree. See:
> http://docs.scipy.org/doc/scipy/reference/spatial.html > http://mloss.org/software/view/143/ > http://www.mrzv.org/software/pyANN/ > etc. In general this is a good suggestion - I was going to mention it earlier - but I think for this particular problem it is not better than the "brute force" and argmin() NumPy approach. On my laptop, the KDTree query is about a factor of 7 slower (ignoring the time cost to create the KDTree) In [45]: def foo1(element, targets): distance_squared = ((element-targets)**2).sum(1) min_index = distance_squared.argmin() return sqrt(distance_squared[min_index]), min_index ....: In [46]: def foo2(element, T): return T.query(element) In [47]: element = np.arange(1,8) In [48]: targets = np.random.uniform(0,8,(5000,7)) In [49]: T = scipy.spatial.KDTree(targets) In [50]: %timeit foo1(element, targets) 1000 loops, best of 3: 401 us per loop In [51]: %timeit foo2(element, T) 100 loops, best of 3: 2.92 ms per loop Just to make sure they say the same thing: In [53]: foo1(element, targets) Out[53]: (1.8173671152898632, 570) In [54]: foo2(element, T) Out[54]: (1.8173671152898632, 570) I think KDTree is more optimal for larger searches (more than 5000 elements), and fewer dimensions. For example, with 500,000 elements and 2 dimensions, I get 34 ms for NumPy and 2 ms for the KDtree. Back to the original question, for 400 us per search, even over 128x512 elements that would be 26 seconds total, I think? That might not be too bad. Cheers, Aronne _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion