Github user sethah commented on the issue:
https://github.com/apache/spark/pull/15148
So I'll try to summarize the AND/OR amplification and how I think it fits
into the current API right now. LSH relies on a single hashing function `h(x)`
which is (R, cR, p1, p2)-sensitive which just means it meets certain properties
needed for LSH. In the case of 2-stable method `h(x) = floor((x dot r) / w)`
which maps `Vector[Double] => Int`. p1 and p2 correspond to "good" and "bad"
collision probabilities respectively. To decrease the probability of a bad
collision we can use AND-amplification by creating a new, compound hash
function `g(x) = [h1(x), h2(x), ..., hd(x)]` where the `h_i(x)` correspond to
different random vectors `r`. Now we only consider collisions for two vectors x
and y if g(x) == g(y) (i.e. standard vector equality). This makes the
probability of both types of collisions decrease to `(p1^d, p2^d)`. For a
hypothetical (0.8, 0.2)-sensitive distribution this goes to `(0.4, 0.0016)` for
d = 4. Making the false-positive rate very low, but meaning we also miss a lot
of good candidates. To mitigate this we can further apply OR-amplification by
generating not one compound hash function g(x) but `L` compound functions
````
g1(x) = [h11(x), ..., h1d(x)]
g2(x) = [h21(x), ..., h2d(x)]
gL(x) = [hL1(x), ..., hLd(x)]
````
Then we convert the original probabilities to `(1 - (1 - p1^L)^b, 1 - (1 -
p2^L)^b)` and in our example `(0.8, 0.2) => (0.8785, 0.006)` for L = 4, d = 4.
The current implementation is equivalent to the `L = 1` case always, and
`outputDim` corresponds to `d`. The concern I have with the RandomProjection
API right now is that if we extend to offer arbitrary `L` then our models do
not store just a d-dimensional array of random vectors but more like a `L x d`
matrix of random vectors. And we would have `hashFunctions` instead of
`hashFunction` (though this is still private). One question I have is - why do
we expose `randUnitVectors` at all? I feel it leaves us more room for changes
in the future if we do not expose it, especially considering the points I just
made. There may be some reason to expose it that I haven't thought of though.
What do we think about changing it to private?
I like the idea of changing `outputDim` to something related to
OR-amplification a lot. I think minhash is done properly right now but the
`hashDistance` measure doesn't make sense as already discussed. Right now, I'd
like to focus on making sure we don't corner ourselves with the API since
internal algo details and documentation can always be changed later.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]