[jira] Commented: (MAHOUT-344) Minhash based clustering

Ankur (JIRA) Wed, 22 Sep 2010 07:05:01 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913560#action_12913560
 ]


Ankur commented on MAHOUT-344:
------------------------------

Hi Ted,
             There is bit of work left in this, specifically:-

1. Adding murmurhash as an option for available hash functions. Looks like I 
can use the checkin of MAHOUT-503.
2. Unit test case completion demonstrating correctness.
3. Example code completion and cleanup. After a bit of thought I feel that the 
average item similarity across all clusters might not be a good criteria for a 
probabilistic clustering technique like this. Instead I am planning to write a 
unit test to calculate precision as folllows:-
             - Set a min similarity threshold for intra cluster items, for e.x 
0.4
             - For each cluster out of randomly selected subset of all 
clusters, run pairwise similarity test to count true positives (TP) that pass 
the threshold.
             - Precision = (TP/Total-items-in-clusters)
Do you see any problems ? Any other suggestions ?
4. Documentation and cleanup.

I should be able to provide an updated patch by end of this week and with one 
more round of review and changes this should be good to go in by end of next 
week. If the timeline sounds acceptable for 0.4 then we're good else we'll have 
to push this one out to 0.5

> Minhash based clustering 
> -------------------------
>
>                 Key: MAHOUT-344
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-344
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.3
>            Reporter: Ankur
>            Assignee: Ankur
>             Fix For: 0.4
>
>         Attachments: MAHOUT-344-v1.patch, MAHOUT-344-v2.patch, 
> MAHOUT-344-v3.patch
>
>
> Minhash clustering performs probabilistic dimension reduction of high 
> dimensional data. The essence of the technique is to hash each item using 
> multiple independent hash functions such that the probability of collision of 
> similar items is higher. Multiple such hash tables can then be constructed  
> to answer near neighbor type of queries efficiently.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-344) Minhash based clustering

Reply via email to