[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

sethah Thu, 10 Nov 2016 08:39:05 -0800

Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/15800
  
    I agree with @jkbradley's suggested approach. One key point here (for 
MinHash):
    
    If a query point vector q hashes to some MinHash Vector [5.0, 22.0, 13.0] 
the best candidates will be ones that hash to that same vector - I think we all 
agree. Now, if we wish to search for other candidates that are similar to q but 
do not hash to exactly that hash vector, we should not think of searching 
"nearby" buckets. A vector x1 which hashes to [5.0, 23.0, 13.0] _is no closer_ 
than a vector x2 which hashes to [5.0, 739.0, 13.0]. Though they are both more 
likely to be near-neighbors than something which has zero bucket collisions. 
The individual values have binary similarities, but looking at the entire 
vector we can use total number of individual collisions as an aggregate measure 
of closeness. 
    
    This is my understanding, and I think Joseph's suggestions are correct. 
Though I did not follow the second half of @Yunni's post...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

Reply via email to