[jira] Commented: (MAHOUT-344) Minhash based clustering

Ankur (JIRA) Thu, 30 Sep 2010 07:37:59 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916491#action_12916491
 ]


Ankur commented on MAHOUT-344:
------------------------------

Having the seed as fixed parameters does not give us a family of hash 
functions, it will be the same hash functions repeated K times. What we need is 
a family of hash functions F where each function  h(s) is defined as follows:-
h(s) = index of first element in a random permutation of all possible values 
that is also present in set s
Then probability that two sets agree on a hash value is equal to their Jaccard 
coefficient i.e.
P[h(s1) = h(s2) ] = (s1 intersect s2) / (s1 union s2)

Having multiple different hash functions is an implementation trick that 
simulates the random permutation of all possible values. Having said that, 
there is nothing that prevents our seed generator to generate duplicate seeds, 
essentially making 2 or more hash functions identical. Also I think it will 
make a difference if we just accept prime seeds.  What do you think ?
Let me make the change of discarding duplicate seeds and accepting only primes 
to see what happens.

BTW I completed the javadoc comments and also added support for 
http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-1K.tar.gz dataset 
which I think is more dense.
I will post a new patch shortly.      

> Minhash based clustering 
> -------------------------
>
>                 Key: MAHOUT-344
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-344
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.3
>            Reporter: Ankur
>            Assignee: Ankur
>             Fix For: 0.4
>
>         Attachments: MAHOUT-344-v1.patch, MAHOUT-344-v2.patch, 
> MAHOUT-344-v3.patch, MAHOUT-344-v4.patch, MAHOUT-344-v5.patch, 
> MAHOUT-344-v6.patch
>
>
> Minhash clustering performs probabilistic dimension reduction of high 
> dimensional data. The essence of the technique is to hash each item using 
> multiple independent hash functions such that the probability of collision of 
> similar items is higher. Multiple such hash tables can then be constructed  
> to answer near neighbor type of queries efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-344) Minhash based clustering

Reply via email to