group Id should be included in clusterId for MinHash clustering
---------------------------------------------------------------
Key: MAHOUT-579
URL: https://issues.apache.org/jira/browse/MAHOUT-579
Project: Mahout
Issue Type: Bug
Components: Clustering
Affects Versions: 0.4
Reporter: Forest Tan
Fix For: 0.4
Current implementation of MinHash clustering use N groups of hash value as
clusterid, e.g., 10003226-1109023
And the code(MinHashMapper.java) is as following:
for (int i = 0; i < this.numHashFunctions; i += this.keyGroups)
{
StringBuilder clusterIdBuilder = new StringBuilder();
for (int j = 0; (j < this.keyGroups) && (i + j <
this.numHashFunctions); j++)
{
clusterIdBuilder.append(this.minHashValues[(i +
j)]).append('-');
}
String clusterId = clusterIdBuilder.toString();
clusterId = clusterId.substring(0, clusterId.lastIndexOf('-'));
Text cluster = new Text(clusterId);
Writable point;
if (this.debugOutput)
point = new VectorWritable(featureVector.clone());
else
{
point = new Text(item.toString());
}
context.write(cluster, point);
}
For example, when KEY_GROUPS=1, NUM_HASH_FUNCTIONS=2, and minhash result is:
userid, minhash1, minhash2
A, 100, 200
B, 200, 100
the clustering result will be:
clusterid, userid
100, A
200, A
200, B
100, B
And user A, B will be in the same cluster 100 and 200.
However, the first and the second hash functions are different, so, it doesn't
mean the two users are similar even if minhash1 of A equals to minhash2 of B.
The fix is easy, just change the line
clusterId = clusterId.substring(0, clusterId.lastIndexOf('-'));
to
clusterId = clusterId + i;
After the fix, the clustering result will be:
clusterid, userid
100-0, A
200-1, A
200-0, B
100-1, B
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.