[ https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849598#action_12849598 ]
Ankur commented on MAHOUT-344: ------------------------------ Appreciate your interest in this. I'd suggest that we pick a dataset to try this on and then make changes as required. For ideas :- 1. Clustering a p2p dataset like this - http://warsteiner.db.cs.cmu.edu/db-site/Datasets/graphData/eDoneky-p2p/ to find out nodes closer to each other. 2. Clustering similar items/songs in this - http://www.iua.upf.es/~ocelma/MusicRecommendationDataset/index.html for recommendations. Talking about missing things in the implementation:- 1. Option of more hash functions for user's to experiment with. 2. Code for cluster goodness evaluation (Precision/Recall tests?) 3. Unit tests for completeness. May be other Mahout folks can take a quick look at the patch and suggest more ideas. > Minhash based clustering > ------------------------- > > Key: MAHOUT-344 > URL: https://issues.apache.org/jira/browse/MAHOUT-344 > Project: Mahout > Issue Type: Bug > Components: Clustering > Affects Versions: 0.3 > Reporter: Ankur > Assignee: Ankur > Attachments: MAHOUT-344-v1.patch > > > Minhash clustering performs probabilistic dimension reduction of high > dimensional data. The essence of the technique is to hash each item using > multiple independent hash functions such that the probability of collision of > similar items is higher. Multiple such hash tables can then be constructed > to answer near neighbor type of queries efficiently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.