Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67152810
@jkbradley Yes let's do `numBins`, I'm changing it now. Yeah, say you have
100 elements in 10 partitions, and want to sample down to 12. That means
sampling about every 100/12 ~= 8th element. But the simplistic approach samples
20 elements, since each of 10 partitions will squash 1-8 and 9-10 into 2 new
elements. Ideally 9-10 belong with 1-6 of the next partition or something. But
stitching that together seems like more trouble than it's worth, or am I being
pessimistic/lazy? or maybe I misunderstand your idea of offsets into the
partition.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]