Am 26.10.2012 15:35, schrieb Olivier Grisel: > BTW, in the mean time you could encode your coocurrences as text > identifiers use either Lucene/Solr in Java using the sunburnt python > client or woosh [1] in python as a way to do efficient sparse lookups > in such a sparse matrix to be able to quickly compute the non zero > cosine similarities between all pairs. Solr also as MoreLikeThis > queries that can be used to truncate the search to the top most > similar samples in the set of samples in the case you have some very > frequent non zero features that would mostly break the sparsity of the > cosine similarity matrix. As Trey Grainger says in his talk "Building > a real time, solr-powered recommendation engine": "A Lucene index is a > multi-dimensional sparse matrix… with very fast and powerful lookup > capabilities." [1] http://packages.python.org/Whoosh/quickstart.html > [2] > http://www.slideshare.net/treygrainger/building-a-real-time-solrpowered-recommendation-engine
Thanks, this looks promising. What do you exactly mean, by encoding cooccurrences as text identifiers? How would I handle my sparse vectors then? I know the MoreLikeThis functionality, but does it exactly do cosine similarity? The thing is, that I need this relatedness emasure for my studies. Philipp ------------------------------------------------------------------------------ WINDOWS 8 is here. Millions of people. Your app in 30 days. Visit The Windows 8 Center at Sourceforge for all your go to resources. http://windows8center.sourceforge.net/ join-generation-app-and-make-money-coding-fast/ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general