[ 
https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated MAHOUT-344:
-------------------------

    Attachment: MAHOUT-344-v4.patch

Finally some action from my side ;-)

1. HashFunction is now an interface with a single method - hash().
2. Implementations of different hash functions are now moved to a HashFactory 
that also provides factory method for fetching hashFunctions of a requested 
type (linear, polynomial, murmur).
3. Minhash mapper/reducer code cleaned up quite a bit.
4. Added options for minimum vector size and hashType.

Pending tasks
1. Fix the Unit test case.
2. Fix example code over Last FM dataset.
3. Add Javadoc documentation.

I hope to complete the above task by EOD tomorrow and submit a new patch.     


> Minhash based clustering 
> -------------------------
>
>                 Key: MAHOUT-344
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-344
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.3
>            Reporter: Ankur
>            Assignee: Ankur
>             Fix For: 0.4
>
>         Attachments: MAHOUT-344-v1.patch, MAHOUT-344-v2.patch, 
> MAHOUT-344-v3.patch, MAHOUT-344-v4.patch
>
>
> Minhash clustering performs probabilistic dimension reduction of high 
> dimensional data. The essence of the technique is to hash each item using 
> multiple independent hash functions such that the probability of collision of 
> similar items is higher. Multiple such hash tables can then be constructed  
> to answer near neighbor type of queries efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to