tl;dr: I think that our stream-compatibility policy is holding us back, and I think we can come up with a way forward with a new policy that will allow us to innovate without seriously compromising our reliability.
To recap, our current policy for numpy.random is that we guarantee that the stream of random numbers from any of the methods of a seeded `RandomState` does not change from version to version (at least on the same hardware, OS, compiler, etc.), except in the case where we are fixing correctness bugs. That is, for code like this: prng = np.random.RandomState(12345) x = prng.normal(10.0, 3.0, size=100) `x` will be the same exact floats regardless of what version of numpy was installed. There seems to be a lot of pent-up motivation to improve on the random number generation, in particular the distributions, that has been blocked by our policy. I think we've lost a few potential first-time contributors that have run up against this wall. We have been pondering ways to allow for adding new core PRNGs and improve the distribution methods while maintaining stream-compatibility for existing code. Kevin Sheppard, in particular, has been working hard to implement new core PRNGs with a common API. https://github.com/bashtage/ng-numpy-randomstate Kevin has also been working to implement the several proposals that have been made to select different versions of distribution implementations. In particular, one idea is to pass something to the RandomState constructor to select a specific version of distributions (or switch out the core PRNG). Note that to satisfy the policy, the simplest method of seeding a RandomState will always give you the oldest version: what we have now. Kevin has recently come to the conclusion that it's not technically feasible to add the version-selection at all if we keep the stream-compatibility policy. https://github.com/numpy/numpy/pull/10124#issuecomment-350876221 I would argue that our current policy isn't providing the value that it claims to. In the first place, there are substantial holes in the reproducibility of streams across platforms. All of the integers (internal and external) are C `long`s, so integer overflows can cause variable streams if you use any of the rejection algorithms involving integers across Windows and Linux. Plain-old floating point arithmetic differences between platforms can cause similar issues (though rarer). Our policy of fixing bugs interferes with strict reproducibility. And our changes to non-random routines can interfere with the ability to reproduce the results of the whole software, independent of the PRNG stream. The multivariate normal implementation is even more vulnerable, as it uses `np.linalg` routines that may be affected by which LAPACK library numpy is built against much less changes that we might make to them in the normal course of development. At the time I established the policy (2008-9), there was significantly less tooling around for pinning versions of software. The PyPI/pip/setuptools ecosystem was in its infancy, VMs were slow cumbersome beasts mostly used to run Windows programs unavailable on Linux, and containerization a la Docker was merely a dream. A lot of resources have been put into reproducible research since then that pins the whole stack from OS libraries on up. The need to have stream-compatibility across numpy versions for the purpose of reproducible research is much diminished. I think that we can relax the strict stream-compatibility policy to allow innovation without giving up much practically-usable stability. Let's compare with Python's policy: https://docs.python.org/3.6/library/random.html#notes-on-reproducibility """ Most of the random module’s algorithms and seeding functions are subject to change across Python versions, but two aspects are guaranteed not to change: * If a new seeding method is added, then a backward compatible seeder will be offered. * The generator’s random() method will continue to produce the same sequence when the compatible seeder is given the same seed. """ I propose that we adopt a similar policy. This would immediately resolve many of the issues blocking innovation in the random distributions. Improvements to the distributions could be made at the same rhythm as normal features. No version-selection API would be required as you select the version by installing the desired version of numpy. By default, everyone gets the latest, best versions of the sampling algorithms. Selecting a different core PRNG could be easily achieved as ng-numpy-randomstate does it, by instantiating different classes. The different incompatible ways to initialize different core PRNGs (with unique features like selectable streams and the like) are transparently handled: different classes have different constructors. There is no need to jam all options for all core PRNGs into a single constructor. I would add a few more of the simpler distribution methods to the list that *is* guaranteed to remain stream-compatible, probably `randint()` and `bytes()` and maybe a couple of others. I would appreciate input on the matter. The current API should remain available and working, but not necessarily with the same algorithms. That is, for code like the following: prng = np.random.RandomState(12345) x = prng.normal(10.0, 3.0, size=100) `x` is still guaranteed to be 100 normal variates with the appropriate mean and standard deviation, but they might be computed by the ziggurat method from PCG-generated bytes (or whatever new default core PRNG we have). As an alternative, we may also want to leave `np.random.RandomState` entirely fixed in place as deprecated legacy code that is never updated. This would allow current unit tests that depend on the stream-compatibility that we previously promised to still pass until they decide to update. Development would move to a different class hierarchy with new names. I am personally not at all interested in preserving any stream compatibility for the `numpy.random.*` aliases or letting the user swap out the core PRNG for the global PRNG that underlies them. `np.random.seed()` should be discouraged (if not outright deprecated) in favor of explicitly passing around instances. In any case, we have a lot of different options to discuss if we decide to relax our stream-compatibility policy. At the moment, I'm not pushing for any particular changes to the code, just the policy in order to enable a more wide-ranging field of options that we have been able to work with so far. Thanks. -- Robert Kern
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion