Github user sethah commented on the issue:
https://github.com/apache/spark/pull/15800
I agree with @jkbradley's suggested approach. One key point here (for
MinHash):
If a query point vector q hashes to some MinHash Vector [5.0, 22.0, 13.0]
the best candidates will be ones that hash to that same vector - I think we all
agree. Now, if we wish to search for other candidates that are similar to q but
do not hash to exactly that hash vector, we should not think of searching
"nearby" buckets. A vector x1 which hashes to [5.0, 23.0, 13.0] _is no closer_
than a vector x2 which hashes to [5.0, 739.0, 13.0]. Though they are both more
likely to be near-neighbors than something which has zero bucket collisions.
The individual values have binary similarities, but looking at the entire
vector we can use total number of individual collisions as an aggregate measure
of closeness.
This is my understanding, and I think Joseph's suggestions are correct.
Though I did not follow the second half of @Yunni's post...
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]