[
https://issues.apache.org/jira/browse/MAHOUT-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13068295#comment-13068295
]
Sebastian Schelter commented on MAHOUT-767:
-------------------------------------------
I suggest we create a specialized implementation that uses the "stripes"
pattern from [1]. As we generalize the approach from that paper we'd need to
emit a pair of vectors for each entry, the first holding the partially summed
dot-products/counts, the other holding the norms. These vectors should easily
be mergeable by a combiner.
With this approach, we should be able to cover all currently existing measures
like cooccurrence count, LLR, Tanimoto, Cosine, Euclidean Distance, Manhattan
and maybe even Pearson if someone figures out the math :)
I think we should have a shot at this and maybe completely drop the old too
generic version (we should ask on the user list before dropping it).
[1] Lin: "Scalable Language Processing Algorithms for the Masses: A Case Study
in
Computing Word Co-occurrence Matrices with MapReduce",
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.156.8326&rep=rep1&type=pdf
> Improve RowSimilarityJob performance for count-based distance measures
> ----------------------------------------------------------------------
>
> Key: MAHOUT-767
> URL: https://issues.apache.org/jira/browse/MAHOUT-767
> Project: Mahout
> Issue Type: Improvement
> Reporter: Grant Ingersoll
> Fix For: 0.6
>
>
> (See
> http://www.lucidimagination.com/search/document/40c4f124795c6b5/rowsimilarity_s#42ab816c27c6a9e7
> for background)
> Currently, the RowSimilarityJob defers the calculation of the similarity
> metric until the reduce phase, while emitting many Cooccurrence objects. For
> similarity metrics that are algebraic
> (http://pig.apache.org/docs/r0.8.1/udf.html#Aggregate+Functions) we should be
> able to do much of the computation during the Mapper part of this phase and
> also take advantage of a Combiner.
> We should use a marker interface to know whether a similarity metric is
> algebraic and then make use of an appropriate Mapper implementation,
> otherwise we can fall back on our existing implementation.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira