Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/3723#issuecomment-67559137
So the emphasis is on *RDD*, right? you can already sample to an *Array* on
the driver. You could make the same argument for several other methods.
`take(100)` can't be used to make an `RDD`. I think the logic is that it's for
taking a smallish number of things. Likewise for sampling.
Put differently, how about just taking the `Array` and using
`parallelize()` to make an `RDD` again?
If you really want a huge sample of a much huge-r `RDD`, you probably need
to sample by partition, using a different approach, to do it efficiently. Here
you're sampling over and over until you get enough.
So I think this may not quite fit in with how other similar API methods
work, for better or worse, although maybe you don't have to have this method to
do what you want.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]