Github user karlhigley commented on the issue:
https://github.com/apache/spark/pull/15148
@sethah: I think you're right that there's a discrepancy here, and I'm
embarrassed that I didn't see it when I first reviewed the PR. On a reread of
the source and your comment above, it looks like the LSH models in this PR use
a single hash function to compute a single hash table, which doesn't match my
understanding of OR-amplification. For OR-amplification, multiple hash
functions would be applied to compute multiple hash tables, and points placed
in the same bucket in any hash table would be considered candidate neighbors.
From the
[comments](https://github.com/apache/spark/pull/15148/files#diff-e3391977ca23d69ff7201c8bdcd88437R36),
it looks like the discrepancy might be due to some confusion between the
number of hash functions applied and the dimensionality of the hash functions.
This is a subtle point that I was confused about too, and it took me quite a
while to work it out because different authors use the term "hash function" to
refer to different things at different levels of abstraction. In one sense (at
a lower level), a random projection is made up of many component hash
functions, but in another sense (at a higher level) a random projection
represents a single hash function for the purposes of OR-amplification.
Given that the PR has already been merged, I concur that the best way
forward is to adjust the comments and documentation. That probably involves
changing the references to OR-amplification to simply refer to the
dimensionality of the hash function.
On the other issue you mentioned regarding mismatches between what's
implemented and the linked documents, I think some of that confusion also stems
from inconsistent terminology in the source material. LSH based on p-stable
distributions (for Euclidean distance) does involve random projections,
although the authors don't directly say so in the paper. There's a somewhat
similar LSH method for cosine distance that's sometimes referred to as "sign
random projection" (though the authors of the paper don't use that term
either). Sign random projection is what the "Random Projection" section of the
Wikipedia page is referring to; what's implemented here looks like LSH based
p-stable distributions. Maybe one way to clarify would be to name the models
after the distance measures they're intended to approximate, and provide
explanations of the methods they use in the comments?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]