[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

karlhigley Sat, 05 Nov 2016 19:03:07 -0700

Github user karlhigley commented on the issue:

    https://github.com/apache/spark/pull/15148
  
    @sethah: I think you're right that there's a discrepancy here, and I'm 
embarrassed that I didn't see it when I first reviewed the PR. On a reread of 
the source and your comment above, it looks like the LSH models in this PR use 
a single hash function to compute a single hash table, which doesn't match my 
understanding of OR-amplification.  For OR-amplification, multiple hash 
functions would be applied to compute multiple hash tables, and points placed 
in the same bucket in any hash table would be considered candidate neighbors.
    
    From the 
[comments](https://github.com/apache/spark/pull/15148/files#diff-e3391977ca23d69ff7201c8bdcd88437R36),
 it looks like the discrepancy might be due to some confusion between the 
number of hash functions applied and the dimensionality of the hash functions. 
This is a subtle point that I was confused about too, and it took me quite a 
while to work it out because different authors use the term "hash function" to 
refer to different things at different levels of abstraction. In one sense (at 
a lower level), a random projection is made up of many component hash 
functions, but in another sense (at a higher level) a random projection 
represents a single hash function for the purposes of OR-amplification.
    
    Given that the PR has already been merged, I concur that the best way 
forward is to adjust the comments and documentation. That probably involves 
changing the references to OR-amplification to simply refer to the 
dimensionality of the hash function.
    
    On the other issue you mentioned regarding mismatches between what's 
implemented and the linked documents, I think some of that confusion also stems 
from inconsistent terminology in the source material. LSH based on p-stable 
distributions (for Euclidean distance) does involve random projections, 
although the authors don't directly say so in the paper. There's a somewhat 
similar LSH method for cosine distance that's sometimes referred to as "sign 
random projection" (though the authors of the paper don't use that term 
either). Sign random projection is what the "Random Projection" section of the 
Wikipedia page is referring to; what's implemented here looks like LSH based 
p-stable distributions. Maybe one way to clarify would be to name the models 
after the distance measures they're intended to approximate, and provide 
explanations of the methods they use in the comments?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15148: [SPARK-5992][ML] Locality Sensitive Hashing

Reply via email to