[ 
https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849598#action_12849598
 ] 

Ankur commented on MAHOUT-344:
------------------------------

Appreciate your interest in this. I'd suggest that we pick a dataset to try 
this on and then make changes as required. 
For ideas :-
    1.  Clustering  a p2p dataset like this - 
http://warsteiner.db.cs.cmu.edu/db-site/Datasets/graphData/eDoneky-p2p/ to find 
out nodes closer to each other. 
    2.  Clustering similar items/songs in this - 
http://www.iua.upf.es/~ocelma/MusicRecommendationDataset/index.html for 
recommendations.

Talking about missing things in the implementation:-
   1. Option of more hash functions for user's to experiment with. 
   2. Code for cluster goodness evaluation (Precision/Recall tests?)
   3. Unit tests for completeness.

May be other Mahout folks can take a quick look at the patch and suggest more 
ideas.   

> Minhash based clustering 
> -------------------------
>
>                 Key: MAHOUT-344
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-344
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.3
>            Reporter: Ankur
>            Assignee: Ankur
>         Attachments: MAHOUT-344-v1.patch
>
>
> Minhash clustering performs probabilistic dimension reduction of high 
> dimensional data. The essence of the technique is to hash each item using 
> multiple independent hash functions such that the probability of collision of 
> similar items is higher. Multiple such hash tables can then be constructed  
> to answer near neighbor type of queries efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to