[jira] [Issue Comment Edited] (MAHOUT-767) Improve RowSimilarityJob performance for count-based distance measures

Sebastian Schelter (JIRA) Mon, 25 Jul 2011 11:13:33 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070647#comment-13070647
 ]


Sebastian Schelter edited comment on MAHOUT-767 at 7/25/11 6:11 PM:
--------------------------------------------------------------------

Patch with first proof-of-concept code. It introduces AlgebraicRowSimilarityJob.

Instead of emitting (n*(n-1))/2 pairs from each inverted index entry it emits n 
"stripes" with each stripe consisting of two vectors with the first one holding 
the partial dot products/counts and the second holding the norms of cooccurred 
rows. These stripes can be easily merged by a combiner.

So we emit less objects and hopefully combine a lot of them which should lead 
to performance increasements.

I attached implementations for LLR, Tanimoto, Cosine and Cooccurrence count. 
Euclidean distance and Pearson-Correlation are still missing but we should be 
able to add them later (see AlgebraicVectorSimilarity)

Patch has unit tests, but as I don't have access to a testing cluster currently 
(this will change in the next weeks), it would be great if someone could verify 
that this code performs better than the existing approach, seeing some numbers 
would be awesome.

      was (Author: ssc):
    Patch with first proof-of-concept code. It introduces 
AlgebraicRowSimilarityJob.

Instead of emitting (n*(n-1)) pairs from each inverted index entry it emits n 
"stripes" with each stripe consisting of two vectors with the first one holding 
the partial dot products/counts and the second holding the norms of cooccurred 
rows. These stripes can be easily merged by a combiner.

So we emit less objects and hopefully combine a lot of them which should lead 
to performance increasements.

I attached implementations for LLR, Tanimoto, Cosine and Cooccurrence count. 
Euclidean distance and Pearson-Correlation are still missing but we should be 
able to add them later (see AlgebraicVectorSimilarity)

Patch has unit tests, but as I don't have access to a testing cluster currently 
(this will change in the next weeks), it would be great if someone could verify 
that this code performs better than the existing approach, seeing some numbers 
would be awesome.
  
> Improve RowSimilarityJob performance for count-based distance measures
> ----------------------------------------------------------------------
>
>                 Key: MAHOUT-767
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-767
>             Project: Mahout
>          Issue Type: Improvement
>            Reporter: Grant Ingersoll
>             Fix For: 0.6
>
>         Attachments: MAHOUT-767.patch
>
>
> (See 
> http://www.lucidimagination.com/search/document/40c4f124795c6b5/rowsimilarity_s#42ab816c27c6a9e7
>  for background)
> Currently, the RowSimilarityJob defers the calculation of the similarity 
> metric until the reduce phase, while emitting many Cooccurrence objects.  For 
> similarity metrics that are algebraic 
> (http://pig.apache.org/docs/r0.8.1/udf.html#Aggregate+Functions) we should be 
> able to do much of the computation during the Mapper part of this phase and 
> also take advantage of a Combiner.  
> We should use a marker interface to know whether a similarity metric is 
> algebraic and then make use of an appropriate Mapper implementation, 
> otherwise we can fall back on our existing implementation.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (MAHOUT-767) Improve RowSimilarityJob performance for count-based distance measures

Reply via email to