Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/15148
  
    Thanks for your clarifications. I still don't see where the algorithm used 
in this patch comes from. Here is my summary of how the approach here is 
different than the approach found on wikipedia and several papers, please let 
me know if you don't agree with this:
    
    **Approach found on 
[Wikipedia](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search)
 and [here](http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf) and 
[here](https://people.csail.mit.edu/indyk/p117-andoni.pdf)**
    
    1. Select `d` Gaussian unit vectors `g1, ..., gd`
    2. construct a hash function `h(x) = (floor((g1 dot x) / w), ..., floor((gd 
dot x) / w))` (`h(x)` is a `d` dimensional vector)
    3. construct `L` hash functions (i.e. `L d-dimensional vectors`)
    4. for every point `x_i` in the input, compute `L` hash functions `h1(x_i), 
..., hL(x_i)`
    5. for the query point `q`, compute the `L` hash functions `h1(q), ..., 
hL(q)`
    6. For each point in the input
        
        For each hash function `l = 1 to L`:
            
        `if (h_l(x_i) == h_l(q)) select as candidate`
    7. compute true distances for each candidate
    8. return the candidate if its true distance is less than some threshold `R`
    
    **Approach in this PR**
    
    1. Select `d` Gaussian unit vectors `g1, ..., gd`
    2. construct a hash function `h(x) = (floor((g1 dot x) / w), ..., floor((gd 
dot x) / w))` (`h(x)` is a `d` dimensional vector)
    3. for every point in the input, compute a single hash function `h(x_i)`
    4. compute a single hash function for the query point `h(q)`
    5. for every point in the input, compute the minimum element of the 
absolute distance between the hashes, i.e. `min(h(x_i) - h(q))`
    6. take the subset of the `k` smallest "hash distances" from 5.
    7. compute the true distances for the subset and return the subset sorted 
on this column
    
    Looking forward to hearing your thoughts, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to