Github user sethah commented on the issue:
https://github.com/apache/spark/pull/15148
Thanks for your clarifications. I still don't see where the algorithm used
in this patch comes from. Here is my summary of how the approach here is
different than the approach found on wikipedia and several papers, please let
me know if you don't agree with this:
**Approach found on
[Wikipedia](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search)
and [here](http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf) and
[here](https://people.csail.mit.edu/indyk/p117-andoni.pdf)**
1. Select `d` Gaussian unit vectors `g1, ..., gd`
2. construct a hash function `h(x) = (floor((g1 dot x) / w), ..., floor((gd
dot x) / w))` (`h(x)` is a `d` dimensional vector)
3. construct `L` hash functions (i.e. `L d-dimensional vectors`)
4. for every point `x_i` in the input, compute `L` hash functions `h1(x_i),
..., hL(x_i)`
5. for the query point `q`, compute the `L` hash functions `h1(q), ...,
hL(q)`
6. For each point in the input
For each hash function `l = 1 to L`:
`if (h_l(x_i) == h_l(q)) select as candidate`
7. compute true distances for each candidate
8. return the candidate if its true distance is less than some threshold `R`
**Approach in this PR**
1. Select `d` Gaussian unit vectors `g1, ..., gd`
2. construct a hash function `h(x) = (floor((g1 dot x) / w), ..., floor((gd
dot x) / w))` (`h(x)` is a `d` dimensional vector)
3. for every point in the input, compute a single hash function `h(x_i)`
4. compute a single hash function for the query point `h(q)`
5. for every point in the input, compute the minimum element of the
absolute distance between the hashes, i.e. `min(h(x_i) - h(q))`
6. take the subset of the `k` smallest "hash distances" from 5.
7. compute the true distances for the subset and return the subset sorted
on this column
Looking forward to hearing your thoughts, thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]