Am 26.10.2012 14:27, schrieb Olivier Grisel:
> 2012/10/26 Philipp Singer <kill...@gmail.com>:
>> Hey there!
>>
>> Currently I am working on very large sparse vectors and have to
>> calculate similarity between all pairs of them.
> How many features? Are they sparse? If so which sparsity level?

In detail: I have a large co-occurrence matrix with a shape of around 
3.7Mill x 3.7Mill. Yes, they are sparse, but I can't tell you the exacty 
sparsity level right now, but as it seems they should be very sparse 
because a single element does not have a co-occurrence count to a large 
number of other elements in my case.

The "problem" is that I need cosine similarity in my case, so I also 
can't use the specific suitable implementations of distances available 
in numpy, scipy or scikit-learn, but I just pass over a callable 
function that does the job. (Currently, I am using a complete own 
implementation for this, because it is just impossible to calculate 
all-pairs-similarity for my large data at the moment)
>
>> I have now looked into the available code in scikit-learn and also at
>> corresponding literature.
>> So I stumbled upon this paper [1] and the corresponding implementation [2].
>>
>> I was now thinking, if this would be a potential improvement / help for
>> scikit-learn for working with very large feature files where it is still
>> necessary to calculate the pair-wise similarity of vectors for different
>> classificators or other tasks. So the goal would be to speed this whole
>> thing up.
>>
>> I am by far no expert in this thing, but just wanted to ask you guys
>> about your opinion ;)
> Computing the sparse cosine similarity matrix of a large (both
> n_samples and n_features) is really lacking in scikit-learn and I
> wanted to implement some tools to do this efficiently when working on
> my power iteration clustering pool request some time ago but never
> found the time to do it.
>
> My idea was to use an in-memory inverted index structure, similar to
> fulltext indexer such as lucene but using integer feature indices
> rather than string feature names / tokens.
>
> Such a data structure would also be interesting for the
> sklearn.neighbors to do efficient k-nearest neighbors multiclass or
> multilabel classification on high dimensional sparse data (which we
> don't address efficiently with the current BallTree datastructure that
> is optimal for less than 100 dense features).
That would be awesome as I already had the impression that k-nearest 
neighbors works very slow for large data in scikit-learn and that was 
also the link to classification I made above for which this would be 
helpful to.
>
>> [1] http://www.bayardo.org/ps/www2007.pdf
>> [2] http://code.google.com/p/google-all-pairs-similarity-search/
> Thanks for the links, added them to my reading list.
Perfect ;)

Regards,
Philipp
>


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_sfd2d_oct
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to