Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/15800
Hi @jkbradley,
I agree with your claim on estimating Jaccard similarity, but looks like
your `L` and `k` are having the same effect on the performance. Consider a case
when we want to trade running time (#rows to sort) for more accurate k nearest
neighbors. In this case,
- Using the average of the indicators as probing sequence can minimize the
rows to sort, but may have more false positive rate.
- False negatives always have more negative impacts on accuracy than false
positives.
For MinHash, we can increase `L`(dimension of OR-amplification) and do the
following:
(1) For the searching key, we can check all `L` buckets that contains the
key
(2) If there are not enough buckets, we all buckets that are 1 steps away
from buckets in (1)
(3) If still not enough, search buckets that 2 steps away from buckets in
(2)
......
In each step, we are searching more than 1 buckets, which gives us more
chance to include the exact k-NN in our search range.
In general, I agree with your rough argument, but I want to add the
following:
- An efficient estimator of similarity is not the best probing sequence. A
probing sequence with larger search range can help us get higher accuracy.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]