If you only care about near matches and not the full n^2 matrix: +1 to OG's suggestion to use pylucene.
You can use pylucene to generate candidates, and then compute the exact tf*idf cosine distance on the shortlist. I assume this will be n log n. Another option for fast all-pairs is to use locality sensitive hashing. (I didn't read the papers or see if that's what they do.) It is not clear what the accuracy will be, but it will probably be the fastest. ] On Fri, Oct 26, 2012 at 3:31 PM, Philipp Singer <kill...@gmail.com> wrote: > Am 26.10.2012 15:35, schrieb Olivier Grisel: >> BTW, in the mean time you could encode your coocurrences as text >> identifiers use either Lucene/Solr in Java using the sunburnt python >> client or woosh [1] in python as a way to do efficient sparse lookups >> in such a sparse matrix to be able to quickly compute the non zero >> cosine similarities between all pairs. Solr also as MoreLikeThis >> queries that can be used to truncate the search to the top most >> similar samples in the set of samples in the case you have some very >> frequent non zero features that would mostly break the sparsity of the >> cosine similarity matrix. As Trey Grainger says in his talk "Building >> a real time, solr-powered recommendation engine": "A Lucene index is a >> multi-dimensional sparse matrix… with very fast and powerful lookup >> capabilities." [1] http://packages.python.org/Whoosh/quickstart.html >> [2] >> http://www.slideshare.net/treygrainger/building-a-real-time-solrpowered-recommendation-engine > > Thanks, this looks promising. What do you exactly mean, by encoding > cooccurrences as text identifiers? How would I handle my sparse vectors > then? > > I know the MoreLikeThis functionality, but does it exactly do cosine > similarity? The thing is, that I need this relatedness emasure for my > studies. > > Philipp > > > ------------------------------------------------------------------------------ > WINDOWS 8 is here. > Millions of people. Your app in 30 days. > Visit The Windows 8 Center at Sourceforge for all your go to resources. > http://windows8center.sourceforge.net/ > join-generation-app-and-make-money-coding-fast/ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general -- Joseph Turian, Ph.D. | President, MetaOptimize "Optimize Profits. Optimize Engagement." http://metaoptimize.com 855-ALL-DATA The web's most active forum for data scientists: http://metaoptimize.com/qa/ ------------------------------------------------------------------------------ WINDOWS 8 is here. Millions of people. Your app in 30 days. Visit The Windows 8 Center at Sourceforge for all your go to resources. http://windows8center.sourceforge.net/ join-generation-app-and-make-money-coding-fast/ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general