2012/10/26 Philipp Singer <kill...@gmail.com>:
> Hey there!
> Currently I am working on very large sparse vectors and have to
> calculate similarity between all pairs of them.

How many features? Are they sparse? If so which sparsity level?

> I have now looked into the available code in scikit-learn and also at
> corresponding literature.
> So I stumbled upon this paper [1] and the corresponding implementation [2].
> I was now thinking, if this would be a potential improvement / help for
> scikit-learn for working with very large feature files where it is still
> necessary to calculate the pair-wise similarity of vectors for different
> classificators or other tasks. So the goal would be to speed this whole
> thing up.
> I am by far no expert in this thing, but just wanted to ask you guys
> about your opinion ;)

Computing the sparse cosine similarity matrix of a large (both
n_samples and n_features) is really lacking in scikit-learn and I
wanted to implement some tools to do this efficiently when working on
my power iteration clustering pool request some time ago but never
found the time to do it.

My idea was to use an in-memory inverted index structure, similar to
fulltext indexer such as lucene but using integer feature indices
rather than string feature names / tokens.

Such a data structure would also be interesting for the
sklearn.neighbors to do efficient k-nearest neighbors multiclass or
multilabel classification on high dimensional sparse data (which we
don't address efficiently with the current BallTree datastructure that
is optimal for less than 100 dense features).

> [1] http://www.bayardo.org/ps/www2007.pdf
> [2] http://code.google.com/p/google-all-pairs-similarity-search/

Thanks for the links, added them to my reading list.

http://twitter.com/ogrisel - http://github.com/ogrisel

Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
Scikit-learn-general mailing list

Reply via email to