[GitHub] spark pull request: [SPARK-4417] New API: sample RDD to fixed numb...

ilganeli Thu, 18 Dec 2014 13:31:38 -0800

Github user ilganeli commented on the pull request:

    https://github.com/apache/spark/pull/3723#issuecomment-67560274
  
    Hi Sean - my concern with using take/collect() like in the previous 
approach is that there is essentially a hard-cap on what is tractable due to 
memory limitations. I wanted to build an implementation that is independent of 
memory, even if it is less efficient. 
    
    The sampling "over and over" will only happen a very small fraction of the 
time (when we're at the very tail end of the statistical distribution used to 
do the sampling). In general, this approach will only make a couple of passes 
over the data (once to sample the data and then at the end, if we have too many 
samples since the sampling is an approximation, pare down to the exact number)/m



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4417] New API: sample RDD to fixed numb...

Reply via email to