[jira] Issue Comment Edited: (MAHOUT-579) group Id should be included in clusterId for MinHash clustering

Ankur (JIRA) Thu, 13 Jan 2011 00:26:16 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12981169#action_12981169
 ]


Ankur edited comment on MAHOUT-579 at 1/13/11 3:23 AM:
-------------------------------------------------------

By collision I mean two items agreeing on one or more hash signatures from 
their  minhash set. The order of hash functions is irrelevant. 

If I understand correctly, what you are suggesting is to suffix each computed 
Minhash value with the hash-function number.  In the hash function, we do a 
(val % p).  This is chosen in the current implementation to be a large prime 
approaching 2^31.  As long as the total number of unique features does NOT 
cross this limit, I don't think we'll see the problem you are suggesting.  

However, it will be good if you could run the MInhash code with & without this 
change on a real dataset and produce some empirical comparison of results.

      was (Author: ankur):
    By collision I mean two items agreeing on one or more hash signatures from 
their  minhash set. The order of hash functions is irrelevant. 

If I understand correctly, what you are suggesting is to suffix each computed 
Minhash value with the hash-function number.  In the hash function, we do a 
(val % p).  This is chosen in the current implementation to be a large prime 
approaching 2^31.  As long as the total number of unique features does cross 
this limit, I don't think we'll see the problem you are suggesting.  

However, it will be good if you could run the MInhash code with & without this 
change on a real dataset and produce some empirical comparison of results.
  
> group Id should be included in clusterId for MinHash clustering
> ---------------------------------------------------------------
>
>                 Key: MAHOUT-579
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-579
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Forest Tan
>            Assignee: Ankur
>             Fix For: 0.5
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Current implementation of MinHash clustering use N groups of hash value as 
> clusterid, e.g., 10003226-1109023
> And the code(MinHashMapper.java) is as following:
> for (int i = 0; i < this.numHashFunctions; i += this.keyGroups)
>         {
>             StringBuilder clusterIdBuilder = new StringBuilder();
>             for (int j = 0; (j < this.keyGroups) && (i + j < 
> this.numHashFunctions); j++)
>             {
>                 clusterIdBuilder.append(this.minHashValues[(i + 
> j)]).append('-');
>             }
>             String clusterId = clusterIdBuilder.toString();
>             clusterId = clusterId.substring(0, clusterId.lastIndexOf('-'));
>             Text cluster = new Text(clusterId);
>             Writable point;
>             if (this.debugOutput)
>                 point = new VectorWritable(featureVector.clone());
>             else
>             {
>                 point = new Text(item.toString());
>             }
>             context.write(cluster, point);
>         }
> For example, when KEY_GROUPS=1, NUM_HASH_FUNCTIONS=2, and minhash result is:
> userid, minhash1, minhash2
> A, 100, 200
> B, 200, 100
> the clustering result will be:
> clusterid, userid
> 100, A
> 200, A
> 200, B
> 100, B
> And user A, B will be in the same cluster 100 and 200. 
> However, the first and the second hash functions are different, so, it 
> doesn't mean the two users are similar even if minhash1 of A equals to 
> minhash2 of B.
> The fix is easy, just change the line
> clusterId = clusterId.substring(0, clusterId.lastIndexOf('-'));
> to
> clusterId = clusterId + i;
> After the fix, the clustering result will be:
> clusterid, userid
> 100-0, A
> 200-1, A
> 200-0, B
> 100-1, B

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (MAHOUT-579) group Id should be included in clusterId for MinHash clustering

Reply via email to