Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/15148
  
    Ok, I'm looking more closely at this algorithm versus the literature. I 
agree that there is a lot of inconsistent terminology which is probably leading 
to some of the confusion here. 
    
    Most or all of the LSH algorithms in the literature describe a process 
which applies a composition of AND and OR amplification. @karlhigley This is 
what the package spark-neighbors does as well, correct? AND amplification is 
applied by generating hash functions `g(x) = (h1(x), h2(x), ..., hd(x))` which 
are concatenations of several of the vanilla locality sensitive hashing 
functions. These algorithms only compare `g(x) == g(y)` for near-neighbor 
candidacy. Still, they then apply OR amplification by using `L` of these 
hashing functions and accepting a point as a candidate if any of the `g_i for i 
= 1 to L` hash functions fall into the same bucket as the query point. 
    
    In this patch we only apply OR amplification by generating a single `g(x) = 
(h1(x), h2(x), ..., hd(x))` and we consider candidates if any of the `h_i for i 
= 1 to d` match. For a `(p1, p2)` sensitive hashing family, this OR 
amplification transforms it into a `(1 - (1 - p1)^d, 1 - (1 - p2)^d)` family, 
where p1 is a "good" collision and p2 is a "bad" collision. Consider a (0.8, 
0.2) hash family where we apply OR amplification with a dimension `d = 10`. We 
will transform this into a `(0.99999989, 0.893)` sensitive family. Basically, 
we amplify the good and bad collisions. If instead we implement the composition 
of **AND then OR** amplification as in the literature, we transform a `(0.8, 
0.2)` sensitive family into a `(.8785, .0064)`. In this way, we amplify the 
"good" collision and dampen the "bad" collision probabilities. If this is 
correct, then I think the current implementation will end up selecting most of 
the points as candidates and may impact the runtime performance. [This r
 eference](http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf) sums it up 
nicely IMO. 
    
    I will look into testing this out more concretely.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to