[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

Yunni Wed, 09 Nov 2016 16:10:15 -0800

Github user Yunni commented on the issue:

    https://github.com/apache/spark/pull/15800
  
    @jkbradley There are 2 reason I don't think averaging indicators is a good 
hashDistance for the current implementation.
    (1) SingleProbe NN performance relies on OR-amplification, changing to 
averaging indicators will increase the false negative rate and hurt the 
accuracy of the result.
    (2) Amplification is an construction method for any LSH (See 3.6.3 of 
http://infolab.stanford.edu/~ullman/mmds/ch3.pdf) I think it's a good 
abstraction to consider the current implementation as OR-amplification and then 
move to AND/OR compound.
    
    When going with `Array[Vector]` as our output type, I think we need to 
change `hashDistance(x: Array[Vector], y: Array[Vector])` to the following:
    (1) `ScalarRandomProjectionLSH`: Minimum of euclidean distance of 
corresponding hash vectors
    (2) `MinHashLSH`: Minimum of averaging indicators of corresponding hash 
vectors
    
    The current implementation is the case when the size of Vector=1, in other 
words, minimum of whether corresponding hash values are equal.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

Reply via email to