Re: Build k-NN graph for large dataset

2015-08-29 Thread Maruf Aytekin
Yes you need to use dimensionality reduction and/or locality sensitive hashing to reduce number of pairs to compare. There is also LSH implementation for collection of vectors I have just published here: https://github.com/marufaytekin/lsh-spark. Implementation i based on this paper: http://www.cs.

Re: Build k-NN graph for large dataset

2015-08-26 Thread Jaonary Rabarisoa
Thank you all for these links. I'll check them. On Wed, Aug 26, 2015 at 5:05 PM, Charlie Hack wrote: > +1 to all of the above esp. Dimensionality reduction and locality > sensitive hashing / min hashing. > > There's also an algorithm implemented in MLlib called DIMSUM which was > developed at T

Re: Build k-NN graph for large dataset

2015-08-26 Thread Charlie Hack
+1 to all of the above esp.  Dimensionality reduction and locality sensitive hashing / min hashing.  There's also an algorithm implemented in MLlib called DIMSUM which was developed at Twitter for this purpose. I've been meaning to try it and would be interested to hear about results you get. 

Re: Build k-NN graph for large dataset

2015-08-26 Thread Michael Malak
7, which should become available in the MEAP mid-September.  http://www.manning.com/books/spark-graphx-in-action From: Kristina Rogale Plazonic To: Jaonary Rabarisoa Cc: user Sent: Wednesday, August 26, 2015 7:24 AM Subject: Re: Build k-NN graph for large dataset If you don&#

Re: Build k-NN graph for large dataset

2015-08-26 Thread Kristina Rogale Plazonic
If you don't want to compute all N^2 similarities, you need to implement some kind of blocking first. For example, LSH (locally sensitive hashing). A quick search gave this link to a Spark implementation: http://stackoverflow.com/questions/2771/spark-implementation-for-locality-sensitive-hashi

Re: Build k-NN graph for large dataset

2015-08-26 Thread Robin East
You could try dimensionality reduction (PCA or SVD) first. I would imagine that even if you could successfully compute similarities in the high-dimensional space you would probably run into the curse of dimensionality. > On 26 Aug 2015, at 12:35, Jaonary Rabarisoa wrote: > > Dear all, > > I'm

Build k-NN graph for large dataset

2015-08-26 Thread Jaonary Rabarisoa
Dear all, I'm trying to find an efficient way to build a k-NN graph for a large dataset. Precisely, I have a large set of high dimensional vector (say d >>> 1) and I want to build a graph where those high dimensional points are the vertices and each one is linked to the k-nearest neighbor base