Github user Yunni commented on the issue:
https://github.com/apache/spark/pull/15800
@jkbradley I agree with your idea to get rid of full sorting and use
`approxQuantile` to find the threshold. Doing a full sort on whole dataset
hurts a lot in performance. Please file a ticket for this.
> You're talking about enlarging search ranges, or iterations, a few times.
Enlarging search ranges does not necessarily mean iterations. The same
threshold logic for (a) gives a larger search range than for (b). Do you agree
with this?
> In both (a) and (b), you come up with some set of candidates. I was
assuming we would compute keyDistance for those candidates and pick the top
ones, just as in the current implementation.
Agree with this part.
BTW, one concrete example, you can run `approxNearestNeighbors for min
hash` in MinHashSuite.scala. Please change `singleProbe = false`
- `hashDistance` in (a) gives precision/recall as (0.95,0.95)
- `hashDistance` in (b) gives precision/recall as (0.6,0.6)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]