[
https://issues.apache.org/jira/browse/MAHOUT-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12981169#action_12981169
]
Ankur edited comment on MAHOUT-579 at 1/13/11 3:23 AM:
-------------------------------------------------------
By collision I mean two items agreeing on one or more hash signatures from
their minhash set. The order of hash functions is irrelevant.
If I understand correctly, what you are suggesting is to suffix each computed
Minhash value with the hash-function number. In the hash function, we do a
(val % p). This is chosen in the current implementation to be a large prime
approaching 2^31. As long as the total number of unique features does NOT
cross this limit, I don't think we'll see the problem you are suggesting.
However, it will be good if you could run the MInhash code with & without this
change on a real dataset and produce some empirical comparison of results.
was (Author: ankur):
By collision I mean two items agreeing on one or more hash signatures from
their minhash set. The order of hash functions is irrelevant.
If I understand correctly, what you are suggesting is to suffix each computed
Minhash value with the hash-function number. In the hash function, we do a
(val % p). This is chosen in the current implementation to be a large prime
approaching 2^31. As long as the total number of unique features does cross
this limit, I don't think we'll see the problem you are suggesting.
However, it will be good if you could run the MInhash code with & without this
change on a real dataset and produce some empirical comparison of results.
> group Id should be included in clusterId for MinHash clustering
> ---------------------------------------------------------------
>
> Key: MAHOUT-579
> URL: https://issues.apache.org/jira/browse/MAHOUT-579
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.4
> Reporter: Forest Tan
> Assignee: Ankur
> Fix For: 0.5
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Current implementation of MinHash clustering use N groups of hash value as
> clusterid, e.g., 10003226-1109023
> And the code(MinHashMapper.java) is as following:
> for (int i = 0; i < this.numHashFunctions; i += this.keyGroups)
> {
> StringBuilder clusterIdBuilder = new StringBuilder();
> for (int j = 0; (j < this.keyGroups) && (i + j <
> this.numHashFunctions); j++)
> {
> clusterIdBuilder.append(this.minHashValues[(i +
> j)]).append('-');
> }
> String clusterId = clusterIdBuilder.toString();
> clusterId = clusterId.substring(0, clusterId.lastIndexOf('-'));
> Text cluster = new Text(clusterId);
> Writable point;
> if (this.debugOutput)
> point = new VectorWritable(featureVector.clone());
> else
> {
> point = new Text(item.toString());
> }
> context.write(cluster, point);
> }
> For example, when KEY_GROUPS=1, NUM_HASH_FUNCTIONS=2, and minhash result is:
> userid, minhash1, minhash2
> A, 100, 200
> B, 200, 100
> the clustering result will be:
> clusterid, userid
> 100, A
> 200, A
> 200, B
> 100, B
> And user A, B will be in the same cluster 100 and 200.
> However, the first and the second hash functions are different, so, it
> doesn't mean the two users are similar even if minhash1 of A equals to
> minhash2 of B.
> The fix is easy, just change the line
> clusterId = clusterId.substring(0, clusterId.lastIndexOf('-'));
> to
> clusterId = clusterId + i;
> After the fix, the clustering result will be:
> clusterid, userid
> 100-0, A
> 200-1, A
> 200-0, B
> 100-1, B
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.