[ 
https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880757#action_12880757
 ] 

Sebastian Schelter commented on MAHOUT-418:
-------------------------------------------

Attached a new patch including equals() and hashCode() for WeightedRowPair, 
thank you for pointing me to that.

I'm not sure whether the code in o.a.m.cf.taste.hadoop.similarity should be 
removed by now. Although it is an implementation of the same algorithm as this 
patch here, there are some differences in the details. By merging them we would 
lose some optimizations in the cf-specific implementation but I agree with Sean 
that it is desirable to have the cf code use standard matrix operations.

Differences between the two implementations:

 * vectors use ints as indices, preferences use longs as IDs, so those IDs 
would need to be mapped to ints and back (I think the distributed recommender 
job is already doing that, so that shouldn't be a big problem)

 * o.a.m.math.hadoop.similarity.RowSimilarityJob writes the whole result matrix 
to disk (although it should be symmetric and no information would be lost if 
only half of it was written) because we need the whole matrix to be available 
for following operations and integration into DistributedRowMatrix

 * o.a.m.cf.taste.hadoop.similarity.SimilarityJob automatically assumes the 
similarity of an item to itself as NaN (and doesn't compute it) whereas a 
similarity matrix created by RowSimilarityJob actively computes and includes 
these values (because it's a mathematical operation and should be agnostic of 
the fact that it's main use case is collaborative filtering)

A possible solution for the cf usecase (that would allow merging the 
implementations) would be to have the RowSimilarityJob do the computation and 
after that pick out only the matrix entries we're interested in in another M/R 
run.

If you want it that way, I can implement that.

> Computing the pairwise similarities of the rows of a matrix
> -----------------------------------------------------------
>
>                 Key: MAHOUT-418
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-418
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-418-2.patch, MAHOUT-418.patch
>
>
> In response to the wish from MAHOUT-362 and the latest discussion on the 
> mailing list started by Kris Jack about computing a document similarity 
> matrix, I tried to generalize the approach we're already using to compute the 
> item-item-similarities for collaborative filtering.
> The job in the patch computes the pairwise similarity of the rows of a matrix 
> in a distributed manner, is uses a SequenceFile<IntWritable,VectorWritable> 
> as input and outputs such a file too. Custom similarity implementations can 
> be supplied, I've already implemented tanimoto and cosine for demo and 
> testing purposes. The algorithm is based on the one presented here: 
> http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> I'd be glad if someone could verify the applicability of this approach by 
> running it with a reasonably large input, I'm also worried that it might 
> buffer to much data in certain steps.
> If you decide to include it in mahout, some more efforts and decisions (like 
> more tests, more similarity measures, integration with DistributedRowMatrix) 
> would need to be made, I guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to