2012/10/26 Philipp Singer <kill...@gmail.com>: > Hey there! > > Currently I am working on very large sparse vectors and have to > calculate similarity between all pairs of them.
How many features? Are they sparse? If so which sparsity level? > I have now looked into the available code in scikit-learn and also at > corresponding literature. > So I stumbled upon this paper [1] and the corresponding implementation [2]. > > I was now thinking, if this would be a potential improvement / help for > scikit-learn for working with very large feature files where it is still > necessary to calculate the pair-wise similarity of vectors for different > classificators or other tasks. So the goal would be to speed this whole > thing up. > > I am by far no expert in this thing, but just wanted to ask you guys > about your opinion ;) Computing the sparse cosine similarity matrix of a large (both n_samples and n_features) is really lacking in scikit-learn and I wanted to implement some tools to do this efficiently when working on my power iteration clustering pool request some time ago but never found the time to do it. My idea was to use an in-memory inverted index structure, similar to fulltext indexer such as lucene but using integer feature indices rather than string feature names / tokens. Such a data structure would also be interesting for the sklearn.neighbors to do efficient k-nearest neighbors multiclass or multilabel classification on high dimensional sparse data (which we don't address efficiently with the current BallTree datastructure that is optimal for less than 100 dense features). > [1] http://www.bayardo.org/ps/www2007.pdf > [2] http://code.google.com/p/google-all-pairs-similarity-search/ Thanks for the links, added them to my reading list. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_sfd2d_oct _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general