Thinking out loud here, we might be able to handle this all in one. For instance, in Pig, when a UDF is algebraic, it can be marked with the Algebraic interface and then Pig will generate a Combiner. For us, we could similarly mark our measures as algebraic (and require the appropriate implementation details) and do a map side count and a combiner, otherwise fall back on the current approach.
Would that work and keep everyone happy? On Jul 19, 2011, at 11:20 AM, Ted Dunning wrote: > On Tue, Jul 19, 2011 at 12:24 AM, Sebastian Schelter <[email protected]> wrote: > >> Class 1 would be count based similarity measures like Tanimoto-coefficient >> or LLR that can be easily combined by summing the partial counts. >> >> Class 2 would be measures that only need the cooccurrences between the >> vectors like Pearson-Correlation or Euclidean distance or Cosine if the >> vectors are normalized, it should be possible to find intelligent (yet a bit >> hacky) ways to combine their intermediate data. >> >> Class 3 would be measures that are possibly user-supplied and need the >> "weight" of the input vectors as well as all the cooccurrences. >> > > I think that with a bit of algebra that the Euclidean and cosine cases can > go into class 1. > > Probably Pearson as well. > > I also remember that we once had someone on the list that used >> RowSimilarityJob for precomputing the similarities between millions of >> documents. Unfortunately I couldn't find the conversation yet. IIRC he >> successfully applied a very aggressive sampling strategy. >> > > That could have been me. I didn't use RowSimilarityJob, but I used to > handle 50 million users and 10-20 million documents using a similar approach > (emit pairs and counts). -------------------------- Grant Ingersoll
