Github user jkbradley commented on the issue:
https://github.com/apache/spark/pull/15148
It sounds like discussions are converging, but I want to confirm a few
things + make a few additions.
### Amplification
Is this agreed?
* Approx neighbors and similarity are doing OR-amplification when comparing
hash values, as described in the [Wikipedia
article](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Stable_distributions).
This is computing an amplified hash function *implicitly*.
* transform() is not doing amplification. It outputs the value of a
collection of hash functions, rather than aggregating them to do amplification.
* This is my main question: Is amplification ever done explicitly, and
when would you ever need that?
Adding combined AND and OR amplification in the future sounds good to me.
My main question right now is whether we need to adjust the API before the 2.1
release. I don't see a need to, but please comment if you see an issue with
the current API.
* One possibility: We could rename outputDim to something specific to
OR-amplification.
Terminology: For LSH, "dimensionality" = "number of hash functions" and is
relevant only for amplification. Do you agree? I have yet to see a hash
function used for LSH which does not have a discrete set.
### Random Projection
I agree this should be renamed to something like "PStableHashing." My
apologies for not doing enough background research to disambiguate.
### MinHash
I think this is implemented correctly, according to the reference given in
the linked Wikipedia article.
* [This
reference](https://github.com/apache/spark/blob/8f0ea011a7294679ec4275b2fef349ef45b6eb81/mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala#L36)
to perfect hash functions may be misleading. I'd prefer to remove it.
### hashDistance
Rethinking this, I am unsure about what function we should use. Currently,
hashDistance is only used by approxNearestNeighbors. Since
approxNearestNeighbors sorts by hashDistance, using a soft measure might be
better than what we currently have:
* MinHash
* Currently: Uses OR-amplification for single probing, and something odd
for multiple probing
* Best option for approxNearestNeighbors: [this Wikipedia
section](https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions),
which is equivalent or OR-amplification when using single probing. I.e.,
replace [this line of
code](https://github.com/apache/spark/blob/8f0ea011a7294679ec4275b2fef349ef45b6eb81/mllib/src/main/scala/org/apache/spark/ml/feature/MinHash.scala#L79)
with: ```x.toDense.values.zip(y.toDense.values).map(pair => pair._1 ==
pair._2).sum / x.size```
* RandomProjection
* Currently: Uses OR-amplification for single probing, and something
reasonable for multiple probing
@Yunni What is the best resource you have for single vs multiple probing?
I'm wondering now if they are uncommon terms and should be renamed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]