[jira] [Commented] (MAHOUT-767) Improve RowSimilarityJob performance for count-based distance measures

Sebastian Schelter (JIRA) Wed, 20 Jul 2011 04:46:30 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068295#comment-13068295
 ]


Sebastian Schelter commented on MAHOUT-767:
-------------------------------------------

I suggest we create a specialized implementation that uses the "stripes" 
pattern from [1]. As we generalize the approach from that paper we'd need to 
emit a pair of vectors for each entry, the first holding the partially summed 
dot-products/counts, the other holding the norms. These vectors should easily 
be mergeable by a combiner.

With this approach, we should be able to cover all currently existing measures 
like cooccurrence count, LLR, Tanimoto, Cosine, Euclidean Distance, Manhattan 
and maybe even Pearson if someone figures out the math :)

I think we should have a shot at this and maybe completely drop the old too 
generic version (we should ask on the user list before dropping it).


[1] Lin: "Scalable Language Processing Algorithms for the Masses: A Case Study 
in
Computing Word Co-occurrence Matrices with MapReduce", 
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.156.8326&rep=rep1&type=pdf


> Improve RowSimilarityJob performance for count-based distance measures
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-767
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-767
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>             Fix For: 0.6
>
>
> (See 
> http://www.lucidimagination.com/search/document/40c4f124795c6b5/rowsimilarity_s#42ab816c27c6a9e7
>  for background)
> Currently, the RowSimilarityJob defers the calculation of the similarity 
> metric until the reduce phase, while emitting many Cooccurrence objects.  For 
> similarity metrics that are algebraic 
> (http://pig.apache.org/docs/r0.8.1/udf.html#Aggregate+Functions) we should be 
> able to do much of the computation during the Mapper part of this phase and 
> also take advantage of a Combiner.  
> We should use a marker interface to know whether a similarity metric is 
> algebraic and then make use of an appropriate Mapper implementation, 
> otherwise we can fall back on our existing implementation.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-767) Improve RowSimilarityJob performance for count-based distance measures

Reply via email to