[ 
https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881141#action_12881141
 ] 

Sebastian Schelter commented on MAHOUT-418:
-------------------------------------------

I think we should take the time and try to find a unified solution. You are 
right that maintaining two implementations of the same algorithm is not 
desirable and from my side there's no hurry to commit this.

I propose the following decisions/assumptions for 
o.a.m.math.hadoop.similarity.RowSimilarityJob

 * similarity functions used must be symmetric 
 * the whole matrix is written to disk
 * the similarity of a row to itself is always computed (making this job a 
usecase-agnostic mathematical operation)

Then o.a.m.cf.taste.hadoop.similarity.SimilarityJob could be reduced to do the 
following things:

 * map the IDs to ints
 * create an item-user-matrix from the preferences
 * run o.a.m.math.hadoop.similarity.RowSimilarityJob
 * pick out the pairs of similar items from the resulting matrix (ignoring the 
similarity of an item to itself)
 * map the ints back to the IDs

Do you think this is a viable way to go here?

> Computing the pairwise similarities of the rows of a matrix
> -----------------------------------------------------------
>
>                 Key: MAHOUT-418
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-418
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-418-2.patch, MAHOUT-418.patch
>
>
> In response to the wish from MAHOUT-362 and the latest discussion on the 
> mailing list started by Kris Jack about computing a document similarity 
> matrix, I tried to generalize the approach we're already using to compute the 
> item-item-similarities for collaborative filtering.
> The job in the patch computes the pairwise similarity of the rows of a matrix 
> in a distributed manner, is uses a SequenceFile<IntWritable,VectorWritable> 
> as input and outputs such a file too. Custom similarity implementations can 
> be supplied, I've already implemented tanimoto and cosine for demo and 
> testing purposes. The algorithm is based on the one presented here: 
> http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> I'd be glad if someone could verify the applicability of this approach by 
> running it with a reasonably large input, I'm also worried that it might 
> buffer to much data in certain steps.
> If you decide to include it in mahout, some more efforts and decisions (like 
> more tests, more similarity measures, integration with DistributedRowMatrix) 
> would need to be made, I guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to