On Tue, Jul 19, 2011 at 12:24 AM, Sebastian Schelter <[email protected]> wrote:
> Class 1 would be count based similarity measures like Tanimoto-coefficient > or LLR that can be easily combined by summing the partial counts. > > Class 2 would be measures that only need the cooccurrences between the > vectors like Pearson-Correlation or Euclidean distance or Cosine if the > vectors are normalized, it should be possible to find intelligent (yet a bit > hacky) ways to combine their intermediate data. > > Class 3 would be measures that are possibly user-supplied and need the > "weight" of the input vectors as well as all the cooccurrences. > I think that with a bit of algebra that the Euclidean and cosine cases can go into class 1. Probably Pearson as well. I also remember that we once had someone on the list that used > RowSimilarityJob for precomputing the similarities between millions of > documents. Unfortunately I couldn't find the conversation yet. IIRC he > successfully applied a very aggressive sampling strategy. > That could have been me. I didn't use RowSimilarityJob, but I used to handle 50 million users and 10-20 million documents using a similar approach (emit pairs and counts).
