Re: [Scikit-learn-general] All-pairs-similarity calculation

Philipp Singer Sat, 27 Oct 2012 15:07:49 -0700

Am 27.10.2012 23:43, schrieb Joseph Turian:
> If you only care about near matches and not the full n^2 matrix:
>
> +1 to OG's suggestion to use pylucene.
>
> You can use pylucene to generate candidates, and then compute the
> exact tf*idf cosine distance on the shortlist.


Yes exactly. I would only need the most similar matches.

The problem with the lucene solution is that I do not need tfidf. I 
really have to do simple cosine similarity on my available vectors.

So e.g., my matrix (vectors) look the following way:

[[1 2 5]
   [3 1 0]]

Now get the cosine similarity between row one and two or in this case 
get the most similar row given row one using cosine similarity without 
any further variations. As already mentioned I have the data in sparse form.
>
> I assume this will be n log n.
>
> Another option for fast all-pairs is to use locality sensitive
> hashing. (I didn't read the papers or see if that's what they do.)
> It is not clear what the accuracy will be, but it will probably be the 
> fastest.
> ]
Yeah, some kind of dimension reduction is another option, but actually 
this would be very hard for me because I have already done all my 
previous experiments on the complete representations, so if I could find 
any faster solution for my problem this would be awesome.

Regards,
Philipp
>
> On Fri, Oct 26, 2012 at 3:31 PM, Philipp Singer <kill...@gmail.com> wrote:
>> Am 26.10.2012 15:35, schrieb Olivier Grisel:
>>> BTW, in the mean time you could encode your coocurrences as text
>>> identifiers use either Lucene/Solr in Java using the sunburnt python
>>> client or woosh [1] in python as a way to do efficient sparse lookups
>>> in such a sparse matrix to be able to quickly compute the non zero
>>> cosine similarities between all pairs. Solr also as MoreLikeThis
>>> queries that can be used to truncate the search to the top most
>>> similar samples in the set of samples in the case you have some very
>>> frequent non zero features that would mostly break the sparsity of the
>>> cosine similarity matrix. As Trey Grainger says in his talk "Building
>>> a real time, solr-powered recommendation engine": "A Lucene index is a
>>> multi-dimensional sparse matrix… with very fast and powerful lookup
>>> capabilities." [1] http://packages.python.org/Whoosh/quickstart.html
>>> [2]
>>> http://www.slideshare.net/treygrainger/building-a-real-time-solrpowered-recommendation-engine
>> Thanks, this looks promising. What do you exactly mean, by encoding
>> cooccurrences as text identifiers? How would I handle my sparse vectors
>> then?
>>
>> I know the MoreLikeThis functionality, but does it exactly do cosine
>> similarity? The thing is, that I need this relatedness emasure for my
>> studies.
>>
>> Philipp
>>
>>
>> ------------------------------------------------------------------------------
>> WINDOWS 8 is here.
>> Millions of people.  Your app in 30 days.
>> Visit The Windows 8 Center at Sourceforge for all your go to resources.
>> http://windows8center.sourceforge.net/
>> join-generation-app-and-make-money-coding-fast/
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>


------------------------------------------------------------------------------
WINDOWS 8 is here. 
Millions of people.  Your app in 30 days.
Visit The Windows 8 Center at Sourceforge for all your go to resources.
http://windows8center.sourceforge.net/
join-generation-app-and-make-money-coding-fast/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] All-pairs-similarity calculation

Reply via email to