[GitHub] spark pull request: [SPARK-4417] New API: sample RDD to fixed numb...

srowen Thu, 18 Dec 2014 13:23:12 -0800

Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/3723#issuecomment-67559137
  
    So the emphasis is on *RDD*, right? you can already sample to an *Array* on 
the driver. You could make the same argument for several other methods. 
`take(100)` can't be used to make an `RDD`. I think the logic is that it's for 
taking a smallish number of things. Likewise for sampling.
    
    Put differently, how about just taking the `Array` and using 
`parallelize()` to make an `RDD` again?
    
    If you really want a huge sample of a much huge-r `RDD`, you probably need 
to sample by partition, using a different approach, to do it efficiently. Here 
you're sampling over and over until you get enough.
    
    So I think this may not quite fit in with how other similar API methods 
work, for better or worse, although maybe you don't have to have this method to 
do what you want.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4417] New API: sample RDD to fixed numb...

Reply via email to