Github user ilganeli commented on the pull request:
https://github.com/apache/spark/pull/3723#issuecomment-67560274
Hi Sean - my concern with using take/collect() like in the previous
approach is that there is essentially a hard-cap on what is tractable due to
memory limitations. I wanted to build an implementation that is independent of
memory, even if it is less efficient.
The sampling "over and over" will only happen a very small fraction of the
time (when we're at the very tail end of the statistical distribution used to
do the sampling). In general, this approach will only make a couple of passes
over the data (once to sample the data and then at the end, if we have too many
samples since the sampling is an approximation, pare down to the exact number)/m
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]