Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/3702#issuecomment-67467841
  
    Yes, just talking about oversampling now. In 1, if you mean ceil(rdd.count 
/ numBins) then yes that's basically what I've got now. You won't quite get 
numBins back, yes. The spacing _will_ be even -- except at partition 
boundaries. You can push around a few points there to amortize the uneven 
space. I don't even think you need oversampling for that.
    
    I'm suggesting 1 as well. I feel like I sound lazy, but, this is a context 
where approximation is entirely fine since the purpose is, say, exporting 
something you could plot in a picture, and because the error is so relatively 
modest in realistic use cases. It doesn't seem worth the complexity or 
processing. Maybe I should just document the couple caveats here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to