Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67417721
Hm I might be missing you point but if just taking every nth point, then
the number of points taken from each partition will be correct to +/- 1
already. You get sample a bit too close together at each partition boundary.
Oversampling might help you space that out a little bit, but does it matter
much? In a partition of 101 elements, taking every 5th, I will take 1, 5, ... ,
95, 100, and then start with 102 in the next partition. Taking 99 instead of
100 is marginally better.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]