Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-67467841
Yes, just talking about oversampling now. In 1, if you mean ceil(rdd.count
/ numBins) then yes that's basically what I've got now. You won't quite get
numBins back, yes. The spacing _will_ be even -- except at partition
boundaries. You can push around a few points there to amortize the uneven
space. I don't even think you need oversampling for that.
I'm suggesting 1 as well. I feel like I sound lazy, but, this is a context
where approximation is entirely fine since the purpose is, say, exporting
something you could plot in a picture, and because the error is so relatively
modest in realistic use cases. It doesn't seem worth the complexity or
processing. Maybe I should just document the couple caveats here?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]