Github user erikerlandson commented on the pull request:
https://github.com/apache/spark/pull/2313#issuecomment-55529273
@mattf, one useful question would be: do the results generate equivalent
output distributions. The basic methodology would be to collect output in
both scenarios, and run Kolmogorov-Smirnov tests to assess whether the sampling
is statistically equivalent.
I did this recently for testing my upcoming proposal for gap sampling:
https://gist.github.com/erikerlandson/05db1f15c8d623448ff6
That doesn't cover the question of *exactly* reproducible results. I'm not
sure if that would be feasible or not. In general, I only consider *exactly*
reproducible results as being relevant for things like unit testing
applications, so if that's important my answer would be "make sure your
environment is set up to either use numpy or not, consistently"
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]