Re: RowSimilarity ?'s

Ted Dunning Tue, 19 Jul 2011 08:21:27 -0700

On Tue, Jul 19, 2011 at 12:24 AM, Sebastian Schelter <[email protected]> wrote:


> Class 1 would be count based similarity measures like Tanimoto-coefficient
> or LLR that can be easily combined by summing the partial counts.
>
> Class 2 would be measures that only need the cooccurrences between the
> vectors like Pearson-Correlation or Euclidean distance or Cosine if the
> vectors are normalized, it should be possible to find intelligent (yet a bit
> hacky) ways to combine their intermediate data.
>
> Class 3 would be measures that are possibly user-supplied and need the
> "weight" of the input vectors as well as all the cooccurrences.
>

I think that with a bit of algebra that the Euclidean and cosine cases can
go into class 1.

Probably Pearson as well.

I also remember that we once had someone on the list that used
> RowSimilarityJob for precomputing the similarities between millions of
> documents. Unfortunately I couldn't find the conversation yet. IIRC he
> successfully applied a very aggressive sampling strategy.
>

That could have been me.  I didn't use RowSimilarityJob, but I used to
handle 50 million users and 10-20 million documents using a similar approach
(emit pairs and counts).

Re: RowSimilarity ?'s

Reply via email to