Hi Kris,

Glad to hear that the code works and is useful to you!

-sebastian

Am 22.06.2010 13:33, schrieb Kris Jack (JIRA):
>     [ 
> https://issues.apache.org/jira/browse/MAHOUT-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881174#action_12881174
>  ] 
>
> Kris Jack commented on MAHOUT-418:
> ----------------------------------
>
> Hi Sebastian,
>
> I ran your latest patch on a set of 10,000,000+ documents and managed to get 
> results being produced in less than 24 hours using 10 mapper and 10 reducers. 
>  My documents have been made very sparse, eliminating any terms with a global 
> frequency > 0.001% of all term frequencies.  I haven't been able to analyse 
> the results systematically yet but they look good at a glimpse.  I think that 
> this is a very valuable contribution to mahout, well done!
>
> Thanks,
> Kris
>
>   
>> Computing the pairwise similarities of the rows of a matrix
>> -----------------------------------------------------------
>>
>>                 Key: MAHOUT-418
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-418
>>             Project: Mahout
>>          Issue Type: New Feature
>>          Components: Math
>>            Reporter: Sebastian Schelter
>>         Attachments: MAHOUT-418-2.patch, MAHOUT-418.patch
>>
>>
>> In response to the wish from MAHOUT-362 and the latest discussion on the 
>> mailing list started by Kris Jack about computing a document similarity 
>> matrix, I tried to generalize the approach we're already using to compute 
>> the item-item-similarities for collaborative filtering.
>> The job in the patch computes the pairwise similarity of the rows of a 
>> matrix in a distributed manner, is uses a 
>> SequenceFile<IntWritable,VectorWritable> as input and outputs such a file 
>> too. Custom similarity implementations can be supplied, I've already 
>> implemented tanimoto and cosine for demo and testing purposes. The algorithm 
>> is based on the one presented here: 
>> http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf
>> I'd be glad if someone could verify the applicability of this approach by 
>> running it with a reasonably large input, I'm also worried that it might 
>> buffer to much data in certain steps.
>> If you decide to include it in mahout, some more efforts and decisions (like 
>> more tests, more similarity measures, integration with DistributedRowMatrix) 
>> would need to be made, I guess.
>>     
>   

Reply via email to