[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

Yunni Thu, 10 Nov 2016 14:17:56 -0800

Github user Yunni commented on the issue:

    https://github.com/apache/spark/pull/15800
  
    > One way to look at it is that (a) will contain many duplicates in the L 
sets of points, so (b) is more likely to have higher precision and recall.
    
    I think this might be the place we are not on the same page. I consider the 
output of (a)/(b) as our "Probing Sequence" (or "Probing buckets"), and in the 
next step we pick and return k nearest keys in those buckets. Do you agree with 
this part?
    
    If you agree, then I claim more duplicates (It's actually redundancy rather 
than duplicates) brings more chance for finding the correct k nearest neighbors 
because we enlarge our search range.
    
    If you disagree, I think we are not discussing based on the same NN search 
implementation (differs from the current implementation). I would like to know 
how you return k nearest neighbor after (b)?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

Reply via email to