Re: [Scikit-learn-general] All-pairs-similarity calculation

Olivier Grisel Fri, 26 Oct 2012 06:36:06 -0700

2012/10/26 Philipp Singer <kill...@gmail.com>:
> Am 26.10.2012 14:27, schrieb Olivier Grisel:
>> 2012/10/26 Philipp Singer <kill...@gmail.com>:
>>> Hey there!
>>>
>>> Currently I am working on very large sparse vectors and have to
>>> calculate similarity between all pairs of them.
>> How many features? Are they sparse? If so which sparsity level?
>
> In detail: I have a large co-occurrence matrix with a shape of around
> 3.7Mill x 3.7Mill. Yes, they are sparse, but I can't tell you the exacty
> sparsity level right now, but as it seems they should be very sparse
> because a single element does not have a co-occurrence count to a large
> number of other elements in my case.
>
> The "problem" is that I need cosine similarity in my case, so I also
> can't use the specific suitable implementations of distances available
> in numpy, scipy or scikit-learn, but I just pass over a callable
> function that does the job. (Currently, I am using a complete own
> implementation for this, because it is just impossible to calculate
> all-pairs-similarity for my large data at the moment)
>>
>>> I have now looked into the available code in scikit-learn and also at
>>> corresponding literature.
>>> So I stumbled upon this paper [1] and the corresponding implementation [2].
>>>
>>> I was now thinking, if this would be a potential improvement / help for
>>> scikit-learn for working with very large feature files where it is still
>>> necessary to calculate the pair-wise similarity of vectors for different
>>> classificators or other tasks. So the goal would be to speed this whole
>>> thing up.
>>>
>>> I am by far no expert in this thing, but just wanted to ask you guys
>>> about your opinion ;)
>> Computing the sparse cosine similarity matrix of a large (both
>> n_samples and n_features) is really lacking in scikit-learn and I
>> wanted to implement some tools to do this efficiently when working on
>> my power iteration clustering pool request some time ago but never
>> found the time to do it.
>>
>> My idea was to use an in-memory inverted index structure, similar to
>> fulltext indexer such as lucene but using integer feature indices
>> rather than string feature names / tokens.
>>
>> Such a data structure would also be interesting for the
>> sklearn.neighbors to do efficient k-nearest neighbors multiclass or
>> multilabel classification on high dimensional sparse data (which we
>> don't address efficiently with the current BallTree datastructure that
>> is optimal for less than 100 dense features).
> That would be awesome as I already had the impression that k-nearest
> neighbors works very slow for large data in scikit-learn and that was
> also the link to classification I made above for which this would be
> helpful to.


BTW, in the mean time you could encode your coocurrences as text
identifiers use either Lucene/Solr in Java using the sunburnt python
client or woosh [1] in python as a way to do efficient sparse lookups
in such a sparse matrix to be able to quickly compute the non zero
cosine similarities between all pairs. Solr also as MoreLikeThis
queries that can be used to truncate the search to the top most
similar samples in the set of samples in the case you have some very
frequent non zero features that would mostly break the sparsity of the
cosine similarity matrix.

As Trey Grainger says in his talk "Building a real time, solr-powered
recommendation engine": "A Lucene index is a multi-dimensional sparse
matrix… with very fast and powerful lookup capabilities."

[1] http://packages.python.org/Whoosh/quickstart.html
[2] 
http://www.slideshare.net/treygrainger/building-a-real-time-solrpowered-recommendation-engine

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] All-pairs-similarity calculation

Reply via email to