[
https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882963#action_12882963
]
Sean Owen commented on MAHOUT-418:
----------------------------------
I looked at it briefly and it seems OK. It's mostly moving code and
generalizing. Am I right that all the original functionality you had is still
there, just refactored? Then I'm OK to commit. My last question is whether the
methods in TasteHadoopUtils can be used elsewhere. there are several other
classes that do similar things. We don't need a new patch to address that, can
do it later. I will run tests and commit if it passes as it seems this is
well-enough considered to go in.
> Computing the pairwise similarities of the rows of a matrix
> -----------------------------------------------------------
>
> Key: MAHOUT-418
> URL: https://issues.apache.org/jira/browse/MAHOUT-418
> Project: Mahout
> Issue Type: New Feature
> Components: Math
> Reporter: Sebastian Schelter
> Attachments: MAHOUT-418-2.patch, MAHOUT-418-3.patch, MAHOUT-418.patch
>
>
> In response to the wish from MAHOUT-362 and the latest discussion on the
> mailing list started by Kris Jack about computing a document similarity
> matrix, I tried to generalize the approach we're already using to compute the
> item-item-similarities for collaborative filtering.
> The job in the patch computes the pairwise similarity of the rows of a matrix
> in a distributed manner, is uses a SequenceFile<IntWritable,VectorWritable>
> as input and outputs such a file too. Custom similarity implementations can
> be supplied, I've already implemented tanimoto and cosine for demo and
> testing purposes. The algorithm is based on the one presented here:
> http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> I'd be glad if someone could verify the applicability of this approach by
> running it with a reasonably large input, I'm also worried that it might
> buffer to much data in certain steps.
> If you decide to include it in mahout, some more efforts and decisions (like
> more tests, more similarity measures, integration with DistributedRowMatrix)
> would need to be made, I guess.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.