Github user jkbradley commented on the issue:
https://github.com/apache/spark/pull/15800
@Yunni I guess we should remove it from the public API. I'm OK with
leaving the code there and making it private for now.
*One response:*
> Enlarging search ranges does not necessarily mean iterations. The same
threshold logic for (a) gives a larger search range than for (b). Do you agree
with this?
If you use the same threshold for both, then I agree. But that's not a
reasonable comparison since (a) will do many times more work and communicate
many times more data (up to L times more). This will happen when you do
posexplode.
If you compare the 2 where each selects the same number of rows (on which
to compute the keyDistance and select neighbors), then (b) will select many
more candidates since it will not have duplicates.
*Also, one new comment:*
I'm testing vs the current implementation (min(abs(query bucket - row
bucket))). Weirdly, the current one is getting consistently better results
than my proposal...even though this does not make sense to me statistically
(and even though the current implementation isn't what any of us are proposing
to use!). I'm still banging my head against this...
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]