[jira] Commented: (MAHOUT-418) Computing the pairwise similarities of the rows of a matrix

Sean Owen (JIRA) Mon, 21 Jun 2010 10:23:53 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12880902#action_12880902
 ]


Sean Owen commented on MAHOUT-418:
----------------------------------

Let's see how far apart these two implementations are. It would be great to 
spend some time unifying them a bit, now, if there is no hurry to get a second 
implementation in.

Yes, the recommender-specific job needs an additional phase at the start and 
end, to map from longs to ints and back. It does do this. This can remain. But 
once data is converted into vectors, the general code you are creating should 
be able to take over?

Both implementations can write the whole matrix, and take the same approach to 
self-similarity. That is I think you are welcome to make them both assume the 
same thing. Just compute and store everything for good measure.

If those are the only differences, it really seems like they are doing the same 
thing and this can be a move of code rather than copy. I think you should feel 
free to go this way, even if it requires change in other code. I can help 
adjust other code if it means some assumptions have changed.

That way you are not burdened with maintaining two implementations. I think 
that makes MAHOUT-420 easier.

What do you think, are you keen to commit this, or open to pushing towards one 
implementation?


> Computing the pairwise similarities of the rows of a matrix
> -----------------------------------------------------------
>
>                 Key: MAHOUT-418
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-418
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Sebastian Schelter
>         Attachments: MAHOUT-418-2.patch, MAHOUT-418.patch
>
>
> In response to the wish from MAHOUT-362 and the latest discussion on the 
> mailing list started by Kris Jack about computing a document similarity 
> matrix, I tried to generalize the approach we're already using to compute the 
> item-item-similarities for collaborative filtering.
> The job in the patch computes the pairwise similarity of the rows of a matrix 
> in a distributed manner, is uses a SequenceFile<IntWritable,VectorWritable> 
> as input and outputs such a file too. Custom similarity implementations can 
> be supplied, I've already implemented tanimoto and cosine for demo and 
> testing purposes. The algorithm is based on the one presented here: 
> http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
> I'd be glad if someone could verify the applicability of this approach by 
> running it with a reasonably large input, I'm also worried that it might 
> buffer to much data in certain steps.
> If you decide to include it in mahout, some more efforts and decisions (like 
> more tests, more similarity measures, integration with DistributedRowMatrix) 
> would need to be made, I guess.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-418) Computing the pairwise similarities of the rows of a matrix

Reply via email to