Github user jkbradley commented on the issue:
https://github.com/apache/spark/pull/15874
Other comments:
**MinHash**
Looking yet again at this, I think it's using a technically incorrect hash
function. It is *not* a perfect hash function. It can hash 2 input indices to
the same hash bucket. (As before, check out the Wikipedia page to see how it's
missing the 2nd stage in the construction of a perfect hash function.) If we
want to fix this, then we could alternatively precompute a random permutation
of indices, which also serves as a perfect hash function.
That said, perhaps it does not matter in practice. If numEntries
(inputDim) is large enough, then the current hash function will probably behave
similarly to a perfect hash function.
**approxNearestNeighbors**
This is still not what I proposed, even for single-probe queries. It will
still have the potential to consider (and sort) a number of candidates much
larger than numNearestNeighbors. Since we're running out of time, I'm fine
with leaving it as is for now and just changing the behavior for the next
release. However, can you please add a note to the method documentation that
this method is experimental and will likely change behavior in the next release?
Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]