2012/10/26 Philipp Singer <kill...@gmail.com>:
> Am 26.10.2012 15:35, schrieb Olivier Grisel:
>> BTW, in the mean time you could encode your coocurrences as text
>> identifiers use either Lucene/Solr in Java using the sunburnt python
>> client or woosh [1] in python as a way to do efficient sparse lookups
>> in such a sparse matrix to be able to quickly compute the non zero
>> cosine similarities between all pairs. Solr also as MoreLikeThis
>> queries that can be used to truncate the search to the top most
>> similar samples in the set of samples in the case you have some very
>> frequent non zero features that would mostly break the sparsity of the
>> cosine similarity matrix. As Trey Grainger says in his talk "Building
>> a real time, solr-powered recommendation engine": "A Lucene index is a
>> multi-dimensional sparse matrix… with very fast and powerful lookup
>> capabilities." [1] http://packages.python.org/Whoosh/quickstart.html
>> [2]
>> http://www.slideshare.net/treygrainger/building-a-real-time-solrpowered-recommendation-engine
>
> Thanks, this looks promising. What do you exactly mean, by encoding
> cooccurrences as text identifiers? How would I handle my sparse vectors
> then?

It's just that the Solr API deals with text document as inputs rather
than precomputed integer feature index + float feature value: you
cannot bypass the text feature extraction layer of Solr (the
analyzers) unfortunately.

> I know the MoreLikeThis functionality, but does it exactly do cosine
> similarity? The thing is, that I need this relatedness emasure for my
> studies.

No it's a truncated approximation (a lower bound) but it keeps many
zeros in your similarity matrix in case you have terms that occur in
every single document.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
WINDOWS 8 is here. 
Millions of people.  Your app in 30 days.
Visit The Windows 8 Center at Sourceforge for all your go to resources.
http://windows8center.sourceforge.net/
join-generation-app-and-make-money-coding-fast/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to