Github user sethah commented on the issue:
https://github.com/apache/spark/pull/15148
Ok, I'm looking more closely at this algorithm versus the literature. I
agree that there is a lot of inconsistent terminology which is probably leading
to some of the confusion here.
Most or all of the LSH algorithms in the literature describe a process
which applies a composition of AND and OR amplification. @karlhigley This is
what the package spark-neighbors does as well, correct? AND amplification is
applied by generating hash functions `g(x) = (h1(x), h2(x), ..., hd(x))` which
are concatenations of several of the vanilla locality sensitive hashing
functions. These algorithms only compare `g(x) == g(y)` for near-neighbor
candidacy. Still, they then apply OR amplification by using `L` of these
hashing functions and accepting a point as a candidate if any of the `g_i for i
= 1 to L` hash functions fall into the same bucket as the query point.
In this patch we only apply OR amplification by generating a single `g(x) =
(h1(x), h2(x), ..., hd(x))` and we consider candidates if any of the `h_i for i
= 1 to d` match. For a `(p1, p2)` sensitive hashing family, this OR
amplification transforms it into a `(1 - (1 - p1)^d, 1 - (1 - p2)^d)` family,
where p1 is a "good" collision and p2 is a "bad" collision. Consider a (0.8,
0.2) hash family where we apply OR amplification with a dimension `d = 10`. We
will transform this into a `(0.99999989, 0.893)` sensitive family. Basically,
we amplify the good and bad collisions. If instead we implement the composition
of **AND then OR** amplification as in the literature, we transform a `(0.8,
0.2)` sensitive family into a `(.8785, .0064)`. In this way, we amplify the
"good" collision and dampen the "bad" collision probabilities. If this is
correct, then I think the current implementation will end up selecting most of
the points as candidates and may impact the runtime performance. [This r
eference](http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf) sums it up
nicely IMO.
I will look into testing this out more concretely.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]