Github user sethah commented on the issue:
https://github.com/apache/spark/pull/15800
@jkbradley Thanks for clarifying, I see your argument now. I agree that it
makes sense from a statistical perspective. Still, I have not seen a single
paper that describes anything quite exactly like what we're proposing. I would
be ok disabling the multi-probe option for the 2.1 release, so we could carry
on this discussion and continue hashing out (pun intended :) the APIs.
It is my understanding that the main benefit of multi-probe described in
the reference paper is to cut down the storage space required by computing many
hash tables, but we are not actually storing the entire hash table as a data
structure so our implementation is a bit different. I think there's room for
discussion/tests about what the benefits are and how drastically they impact
performance.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]