[
https://issues.apache.org/jira/browse/MAHOUT-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916475#action_12916475
]
Sean Owen commented on MAHOUT-344:
----------------------------------
I see. So really we might as well take the first 6 integers generated by "new
Random(11)" and stick them in as the fixed parameters for these hash functions.
If we do that, might we not make sure to pick good values rather than leave it
to the RNG? For example, it's bad if "seedA" in linear hash is 0. I'm sure it
isn't. But if they're both even, that's not so great I think?
That is, what happens if I just stick in some arbitrary primes here?
Would that remove the need to divide module a large prime at the end?
(Also does it matter that 'byteVal' values can be negative? doesn't really seem
so, from the math, but I stopped to wonder at it for a moment.
> Minhash based clustering
> -------------------------
>
> Key: MAHOUT-344
> URL: https://issues.apache.org/jira/browse/MAHOUT-344
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.3
> Reporter: Ankur
> Assignee: Ankur
> Fix For: 0.4
>
> Attachments: MAHOUT-344-v1.patch, MAHOUT-344-v2.patch,
> MAHOUT-344-v3.patch, MAHOUT-344-v4.patch, MAHOUT-344-v5.patch,
> MAHOUT-344-v6.patch
>
>
> Minhash clustering performs probabilistic dimension reduction of high
> dimensional data. The essence of the technique is to hash each item using
> multiple independent hash functions such that the probability of collision of
> similar items is higher. Multiple such hash tables can then be constructed
> to answer near neighbor type of queries efficiently.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.