Hi Kris, Glad to hear that the code works and is useful to you!
-sebastian Am 22.06.2010 13:33, schrieb Kris Jack (JIRA): > [ > https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881174#action_12881174 > ] > > Kris Jack commented on MAHOUT-418: > ---------------------------------- > > Hi Sebastian, > > I ran your latest patch on a set of 10,000,000+ documents and managed to get > results being produced in less than 24 hours using 10 mapper and 10 reducers. > My documents have been made very sparse, eliminating any terms with a global > frequency > 0.001% of all term frequencies. I haven't been able to analyse > the results systematically yet but they look good at a glimpse. I think that > this is a very valuable contribution to mahout, well done! > > Thanks, > Kris > > >> Computing the pairwise similarities of the rows of a matrix >> ----------------------------------------------------------- >> >> Key: MAHOUT-418 >> URL: https://issues.apache.org/jira/browse/MAHOUT-418 >> Project: Mahout >> Issue Type: New Feature >> Components: Math >> Reporter: Sebastian Schelter >> Attachments: MAHOUT-418-2.patch, MAHOUT-418.patch >> >> >> In response to the wish from MAHOUT-362 and the latest discussion on the >> mailing list started by Kris Jack about computing a document similarity >> matrix, I tried to generalize the approach we're already using to compute >> the item-item-similarities for collaborative filtering. >> The job in the patch computes the pairwise similarity of the rows of a >> matrix in a distributed manner, is uses a >> SequenceFile<IntWritable,VectorWritable> as input and outputs such a file >> too. Custom similarity implementations can be supplied, I've already >> implemented tanimoto and cosine for demo and testing purposes. The algorithm >> is based on the one presented here: >> http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf >> I'd be glad if someone could verify the applicability of this approach by >> running it with a reasonably large input, I'm also worried that it might >> buffer to much data in certain steps. >> If you decide to include it in mahout, some more efforts and decisions (like >> more tests, more similarity measures, integration with DistributedRowMatrix) >> would need to be made, I guess. >> >
