[jira] Commented: (MAHOUT-344) Minhash based clustering

Ankur (JIRA) Wed, 31 Mar 2010 00:30:52 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851756#action_12851756
 ]


Ankur commented on MAHOUT-344:
------------------------------

Drew, thanks for pitching in as I've been running super busy with some crap :-)

@Cristi
That's right but its totally unnecessary as each of the mappers can do their 
own initialization of hash functions. They will be the same hash function if 
they used the same seed for java.util.Random(). So distributed cache can be 
removed alltogther with that change. The code will be shorter and simpler.

What is the min-cluster size you are using? How many hash hash functions? How 
many hashes are grouped together? 
We will need some tests to show how good the clusters are. As a start we can 
compute a simple metrics like average similarity of items within a cluster 
aggregated over all clusters.


> Minhash based clustering 
> -------------------------
>
>                 Key: MAHOUT-344
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-344
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.3
>            Reporter: Ankur
>            Assignee: Ankur
>         Attachments: MAHOUT-344-v1.patch
>
>
> Minhash clustering performs probabilistic dimension reduction of high 
> dimensional data. The essence of the technique is to hash each item using 
> multiple independent hash functions such that the probability of collision of 
> similar items is higher. Multiple such hash tables can then be constructed  
> to answer near neighbor type of queries efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-344) Minhash based clustering

Reply via email to