group Id should be included in clusterId for MinHash clustering
---------------------------------------------------------------

                 Key: MAHOUT-579
                 URL: https://issues.apache.org/jira/browse/MAHOUT-579
             Project: Mahout
          Issue Type: Bug
          Components: Clustering
    Affects Versions: 0.4
            Reporter: Forest Tan
             Fix For: 0.4


Current implementation of MinHash clustering use N groups of hash value as 
clusterid, e.g., 10003226-1109023

And the code(MinHashMapper.java) is as following:

for (int i = 0; i < this.numHashFunctions; i += this.keyGroups)
        {
            StringBuilder clusterIdBuilder = new StringBuilder();
            for (int j = 0; (j < this.keyGroups) && (i + j < 
this.numHashFunctions); j++)
            {
                clusterIdBuilder.append(this.minHashValues[(i + 
j)]).append('-');
            }
            String clusterId = clusterIdBuilder.toString();
            clusterId = clusterId.substring(0, clusterId.lastIndexOf('-'));
            Text cluster = new Text(clusterId);
            Writable point;
            if (this.debugOutput)
                point = new VectorWritable(featureVector.clone());
            else
            {
                point = new Text(item.toString());
            }
            context.write(cluster, point);
        }

For example, when KEY_GROUPS=1, NUM_HASH_FUNCTIONS=2, and minhash result is:
userid, minhash1, minhash2
A, 100, 200
B, 200, 100

the clustering result will be:
clusterid, userid
100, A
200, A
200, B
100, B

And user A, B will be in the same cluster 100 and 200. 
However, the first and the second hash functions are different, so, it doesn't 
mean the two users are similar even if minhash1 of A equals to minhash2 of B.

The fix is easy, just change the line
clusterId = clusterId.substring(0, clusterId.lastIndexOf('-'));
to
clusterId = clusterId + i;

After the fix, the clustering result will be:
clusterid, userid
100-0, A
200-1, A
200-0, B
100-1, B

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to