Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/15800
@jkbradley There are 2 reason I don't think averaging indicators is a good
hashDistance for the current implementation.
(1) SingleProbe NN performance relies on OR-amplification, changing to
averaging indicators will increase the false negative rate and hurt the
accuracy of the result.
(2) Amplification is an construction method for any LSH (See 3.6.3 of
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf) I think it's a good
abstraction to consider the current implementation as OR-amplification and then
move to AND/OR compound.
When going with `Array[Vector]` as our output type, I think we need to
change `hashDistance(x: Array[Vector], y: Array[Vector])` to the following:
(1) `ScalarRandomProjectionLSH`: Minimum of euclidean distance of
corresponding hash vectors
(2) `MinHashLSH`: Minimum of averaging indicators of corresponding hash
vectors
The current implementation is the case when the size of Vector=1, in other
words, minimum of whether corresponding hash values are equal.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]