Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/15148
  
    So I'll try to summarize the AND/OR amplification and how I think it fits 
into the current API right now. LSH relies on a single hashing function `h(x)` 
which is (R, cR, p1, p2)-sensitive which just means it meets certain properties 
needed for LSH. In the case of 2-stable method `h(x) = floor((x dot r) / w)` 
which maps `Vector[Double] => Int`. p1 and p2 correspond to "good" and "bad" 
collision probabilities respectively. To decrease the probability of a bad 
collision we can use AND-amplification by creating a new, compound hash 
function `g(x) = [h1(x), h2(x), ..., hd(x)]` where the `h_i(x)` correspond to 
different random vectors `r`. Now we only consider collisions for two vectors x 
and y if g(x) == g(y) (i.e. standard vector equality). This makes the 
probability of both types of collisions decrease to `(p1^d, p2^d)`. For a 
hypothetical (0.8, 0.2)-sensitive distribution this goes to `(0.4, 0.0016)` for 
d = 4. Making the false-positive rate very low, but meaning we also miss a lot
  of good candidates. To mitigate this we can further apply OR-amplification by 
generating not one compound hash function g(x) but `L` compound functions
    
    ````
    g1(x) = [h11(x), ..., h1d(x)]
    g2(x) = [h21(x), ..., h2d(x)]
    gL(x)  = [hL1(x), ..., hLd(x)]
    ````
    
    Then we convert the original probabilities to `(1 - (1 - p1^L)^b, 1 - (1 - 
p2^L)^b)` and in our example `(0.8, 0.2) => (0.8785, 0.006)` for L = 4, d = 4.
    
    The current implementation is equivalent to the `L = 1` case always, and 
`outputDim` corresponds to `d`. The concern I have with the RandomProjection 
API right now is that if we extend to offer arbitrary `L` then our models do 
not store just a d-dimensional array of random vectors but more like a `L x d` 
matrix of random vectors. And we would have `hashFunctions` instead of 
`hashFunction` (though this is still private). One question I have is - why do 
we expose `randUnitVectors` at all? I feel it leaves us more room for changes 
in the future if we do not expose it, especially considering the points I just 
made. There may be some reason to expose it that I haven't thought of though. 
What do we think about changing it to private?
    
    I like the idea of changing `outputDim` to something related to 
OR-amplification a lot. I think minhash is done properly right now but the 
`hashDistance` measure doesn't make sense as already discussed. Right now, I'd 
like to focus on making sure we don't corner ourselves with the API since 
internal algo details and documentation can always be changed later.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to