[GitHub] spark pull request: SPARK-1939 Refactor takeSample method in RDD t...

dorx Fri, 13 Jun 2014 13:59:51 -0700

Github user dorx commented on the pull request:

    https://github.com/apache/spark/pull/916#issuecomment-46059095
  
    @colorant Thanks for taking a look at this! 
    
    First of all let me just say that I ran Xiangrui's code but with 
".fill(1000)" (so 100x in RDD size), and it was still able to select a sample 
with exactly one data point in one pass. 
    
    So there's a couple things in play here. The smallest resolution handled by 
a Double is 2^(-1074) ~ 5e-324, so before we run into RDDs of size ~10^323, we 
in theory won't run into have a sampling rate of 0. Then it comes down to 
whether the random number generator is truly random and isn't biased against 
very small numbers. The two experiments Xiangrui and I ran seem to suggest that 
the java.util.Random object is able to produce small enough random numbers. 
However, we should definitely further investigate the quality of the RNG used 
to gauge sampling behavior at even smaller sampling rates. 
    
    One thing to note about this implementation is that at higher sampling 
rates, we are actually able to save memory by not caching as many samples as 
before in order to be able to guarantee the sample size in one try.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1939 Refactor takeSample method in RDD t...

Reply via email to