Thinking out loud here, we might be able to handle this all in one.  For 
instance, in Pig, when a UDF is algebraic, it can be marked with the Algebraic 
interface and then Pig will generate a Combiner.  For us, we could similarly 
mark our measures as algebraic (and require the appropriate implementation 
details) and do a map side count and a combiner, otherwise fall back on the 
current approach.

Would that work and keep everyone happy?

On Jul 19, 2011, at 11:20 AM, Ted Dunning wrote:

> On Tue, Jul 19, 2011 at 12:24 AM, Sebastian Schelter <[email protected]> wrote:
> 
>> Class 1 would be count based similarity measures like Tanimoto-coefficient
>> or LLR that can be easily combined by summing the partial counts.
>> 
>> Class 2 would be measures that only need the cooccurrences between the
>> vectors like Pearson-Correlation or Euclidean distance or Cosine if the
>> vectors are normalized, it should be possible to find intelligent (yet a bit
>> hacky) ways to combine their intermediate data.
>> 
>> Class 3 would be measures that are possibly user-supplied and need the
>> "weight" of the input vectors as well as all the cooccurrences.
>> 
> 
> I think that with a bit of algebra that the Euclidean and cosine cases can
> go into class 1.
> 
> Probably Pearson as well.
> 
> I also remember that we once had someone on the list that used
>> RowSimilarityJob for precomputing the similarities between millions of
>> documents. Unfortunately I couldn't find the conversation yet. IIRC he
>> successfully applied a very aggressive sampling strategy.
>> 
> 
> That could have been me.  I didn't use RowSimilarityJob, but I used to
> handle 50 million users and 10-20 million documents using a similar approach
> (emit pairs and counts).

--------------------------
Grant Ingersoll



Reply via email to