Github user Yunni commented on the issue:

    https://github.com/apache/spark/pull/15800
  
    Hi @jkbradley,
    I agree with your claim on estimating Jaccard similarity, but looks like 
your `L` and `k` are having the same effect on the performance. Consider a case 
when we want to trade running time (#rows to sort) for more accurate k nearest 
neighbors. In this case,
    - Using the average of the indicators as probing sequence can minimize the 
rows to sort, but may have more false positive rate.
    - False negatives always have more negative impacts on accuracy than false 
positives.
    
    For MinHash, we can increase `L`(dimension of OR-amplification) and do the 
following:
    (1) For the searching key, we can check all `L` buckets that contains the 
key
    (2) If there are not enough buckets, we all buckets that are 1 steps away 
from buckets in (1)
    (3) If still not enough, search buckets that 2 steps away from buckets in 
(2)
    ......
    In each step, we are searching more than 1 buckets, which gives us more 
chance to include the exact k-NN in our search range.
    
    In general, I agree with your rough argument, but I want to add the 
following:
    - An efficient estimator of similarity is not the best probing sequence. A 
probing sequence with larger search range can help us get higher accuracy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to