[ https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851756#action_12851756 ]
Ankur commented on MAHOUT-344: ------------------------------ Drew, thanks for pitching in as I've been running super busy with some crap :-) @Cristi That's right but its totally unnecessary as each of the mappers can do their own initialization of hash functions. They will be the same hash function if they used the same seed for java.util.Random(). So distributed cache can be removed alltogther with that change. The code will be shorter and simpler. What is the min-cluster size you are using? How many hash hash functions? How many hashes are grouped together? We will need some tests to show how good the clusters are. As a start we can compute a simple metrics like average similarity of items within a cluster aggregated over all clusters. > Minhash based clustering > ------------------------- > > Key: MAHOUT-344 > URL: https://issues.apache.org/jira/browse/MAHOUT-344 > Project: Mahout > Issue Type: Bug > Components: Clustering > Affects Versions: 0.3 > Reporter: Ankur > Assignee: Ankur > Attachments: MAHOUT-344-v1.patch > > > Minhash clustering performs probabilistic dimension reduction of high > dimensional data. The essence of the technique is to hash each item using > multiple independent hash functions such that the probability of collision of > similar items is higher. Multiple such hash tables can then be constructed > to answer near neighbor type of queries efficiently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.