GitHub user Yunni opened a pull request:
https://github.com/apache/spark/pull/15800
[SPARK-18334] MinHash should use binary hash distance
## What changes were proposed in this pull request?
MinHash currently is using the same `hashDistance` function as
RandomProjection. This does not make sense for MinHash because the Jaccard
distance of two sets is not relevant to the absolute distance of their hash
buckets indices.
This bug could affect accuracy of multi probing NN search for MinHash.
MinHash hash distance should just be binary since there is no distance on
the buckets.
## How was this patch tested?
An incorrect unit test was also introduced, and it's fixed in this PR.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/Yunni/spark SPARK-18334-yunn-minhash-bug
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15800.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15800
----
commit 559c09904538012b70bcb3493b8bc287dd855b2d
Author: Yun Ni <[email protected]>
Date: 2016-11-07T21:30:32Z
[SPARK-18334] MinHash should use binary hash distance
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]