[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

jkbradley Thu, 10 Nov 2016 13:44:52 -0800

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/15800
  
    > Yes, I mean we average K 0/1 indicators for each Vector, and get L 
averaged indicators. Do you agree with this part?
    
    I'm afraid I still disagree.  There's a fundamental difference between how 
we are computing nearest neighbors and how the LSH literature computes nearest 
neighbors:
    * LSH: This literature is all about data structures where you want to look 
in a set of buckets and nowhere else.  There is no sorting.
    * Us: We are sorting the entire dataset based on a distance estimate.
    
    Say we take LxK indicators and can either:
    * (a) average groups of K to get L distance estimates.  Then select the top 
M points for each estimate, i.e. L*M points total.
    * (b) average all to get 1 distance estimate.  Then select the top L*M 
points for each estimate.
    
    With this setup, we do the same amount of computation and work with the 
same amount of selected points.
    
    The question is then: Which of (a) or (b) will give higher precision and 
recall?
    
    I'll just give an intuitive argument.
    * One way to look at it is that (a) will contain many duplicates in the L 
sets of points, so (b) is more likely to have higher precision and recall.
    * Another is:
      * We are effectively trying to lay out all keys on a number line, where 
we are placing them at our estimate of the Jaccard similarity.  We then set a 
threshold and pick all keys to the right of that threshold.
      * Method (b) is putting each key at the estimate computed from L*K 
indicators.
      * Method (a) is putting each key at the estimate computed as max (over L 
estimates) of the estimate computed from K indicators.
      * Said this way, it is clear that (b) gives a better estimate.
      * Assuming we pick L*M candidate keys for each method, then (a) will have 
to use a much higher threshold, making it have a much higher false negative 
rate.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

Reply via email to