Github user dorx commented on the pull request:
https://github.com/apache/spark/pull/916#issuecomment-46059095
@colorant Thanks for taking a look at this!
First of all let me just say that I ran Xiangrui's code but with
".fill(1000)" (so 100x in RDD size), and it was still able to select a sample
with exactly one data point in one pass.
So there's a couple things in play here. The smallest resolution handled by
a Double is 2^(-1074) ~ 5e-324, so before we run into RDDs of size ~10^323, we
in theory won't run into have a sampling rate of 0. Then it comes down to
whether the random number generator is truly random and isn't biased against
very small numbers. The two experiments Xiangrui and I ran seem to suggest that
the java.util.Random object is able to produce small enough random numbers.
However, we should definitely further investigate the quality of the RNG used
to gauge sampling behavior at even smaller sampling rates.
One thing to note about this implementation is that at higher sampling
rates, we are actually able to save memory by not caching as many samples as
before in order to be able to guarantee the sample size in one try.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---