[
https://issues.apache.org/jira/browse/MAHOUT-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12981073#action_12981073
]
Forest Tan commented on MAHOUT-579:
-----------------------------------
Actually there is no collision in my example, with any hash function(HF1 or
HF2) the minhash value are different.
What I mean is, it is irrational to regard hash value from different hash
functions as from the same. It is better to distinguish which hash function(s)
the minhash values are calculated from.
> group Id should be included in clusterId for MinHash clustering
> ---------------------------------------------------------------
>
> Key: MAHOUT-579
> URL: https://issues.apache.org/jira/browse/MAHOUT-579
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.4
> Reporter: Forest Tan
> Assignee: Ankur
> Fix For: 0.5
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Current implementation of MinHash clustering use N groups of hash value as
> clusterid, e.g., 10003226-1109023
> And the code(MinHashMapper.java) is as following:
> for (int i = 0; i < this.numHashFunctions; i += this.keyGroups)
> {
> StringBuilder clusterIdBuilder = new StringBuilder();
> for (int j = 0; (j < this.keyGroups) && (i + j <
> this.numHashFunctions); j++)
> {
> clusterIdBuilder.append(this.minHashValues[(i +
> j)]).append('-');
> }
> String clusterId = clusterIdBuilder.toString();
> clusterId = clusterId.substring(0, clusterId.lastIndexOf('-'));
> Text cluster = new Text(clusterId);
> Writable point;
> if (this.debugOutput)
> point = new VectorWritable(featureVector.clone());
> else
> {
> point = new Text(item.toString());
> }
> context.write(cluster, point);
> }
> For example, when KEY_GROUPS=1, NUM_HASH_FUNCTIONS=2, and minhash result is:
> userid, minhash1, minhash2
> A, 100, 200
> B, 200, 100
> the clustering result will be:
> clusterid, userid
> 100, A
> 200, A
> 200, B
> 100, B
> And user A, B will be in the same cluster 100 and 200.
> However, the first and the second hash functions are different, so, it
> doesn't mean the two users are similar even if minhash1 of A equals to
> minhash2 of B.
> The fix is easy, just change the line
> clusterId = clusterId.substring(0, clusterId.lastIndexOf('-'));
> to
> clusterId = clusterId + i;
> After the fix, the clustering result will be:
> clusterid, userid
> 100-0, A
> 200-1, A
> 200-0, B
> 100-1, B
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.