On Sun, Jun 10, 2018 at 6:08 PM, Robert Kern <robert.k...@gmail.com> wrote:
> On Sun, Jun 10, 2018 at 5:27 PM Ralf Gommers <ralf.gomm...@gmail.com> > wrote: > > > > On Mon, Jun 4, 2018 at 3:18 PM, Robert Kern <robert.k...@gmail.com> > wrote: > >> > >> On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers <ralf.gomm...@gmail.com> > wrote: > >>> > >>> It may be worth having a look at test suites for scipy, statsmodels, > scikit-learn, etc. and estimate how much work this NEP causes those > projects. If the devs of those packages are forced to do large scale > migrations from RandomState to StableState, then why not instead keep > RandomState and just add a new API next to it? > >> > >> The problem is that we can't really have an ecosystem with two > different general purpose systems. > > > > Can't = prefer not to. > > I meant what I wrote. :-) > > > But yes, that's true. That's not what I was saying though. We want one > generic one, and one meant for unit testing only. You can achieve that in > two ways: > > 1. Change the current np.random API to new generic, and add a new > RandomStable for unit tests. > > 2. Add a new generic API, and document the current np.random API as > being meant for unit tests only, for other usage <new API> should be > preferred. > > > > (2) has a couple of pros: > > - you're not forcing almost every library and end user out there to > migrate their unit tests. > > But it has the cons that I talked about. RandomState *is* a fully > functional general purpose PRNG system. After all, that's its current use. > Documenting it as intended to be something else will not change that fact. > Documentation alone provides no real impetus to move to the new system > outside of the unit tests. And the community does need to move together to > the new system in their library code, or else we won't be able to combine > libraries together; these PRNG objects need to thread all the way through > between code from different authors if we are to write programs with a > controlled seed. The failure mode when people don't pay attention to the > documentation is that I can no longer write programs that compose these > libraries together. That's why I wrote "can't". It's not a mere preference > for not having two systems to maintain. It has binary Go/No Go implications > for building reproducible programs. > I strongly suspect you are right, but only because you're asserting "can't" so heavily. I have trouble formulating what would go wrong in case there's two PRNGs used in a single program. It's not described in the NEP, nor in the numpy.random docs (those don't even have any recommendations for best practices listed as far as I can tell - that needs fixing). All you explain in the NEP is that reproducible research isn't helped by the current stream-compat guarantee. So a bit of (probably incorrect) devil's advocate reasoning: - If there's no stream-compat guarantee, all a user can rely on is the properties of drawing from a seeded PRNG. - Any use of a PRNG in library code can also only rely on properties - So now whether in a user's program libraries draw from one or two seeded PRNGs doesn't matter for reproducibility, because those properties don't change. Also, if there is to be a multi-year transitioning to the new API, would there be two PRNG systems anyway during those years? > > - more design freedom for the new generic API. The current one is > clearly sub-optimal; in a new one you wouldn't have to expose all the > global state/functions that np.random exposes now. You could even restrict > it to a single class and put that in the main numpy namespace. > > I'm not sure why you are talking about the global state and np.random.* > convenience functions. What we do with those functions is out of scope for > this NEP and would be talked about it another NEP fully introducing the new > system. > To quote you from one of the first emails in this thread: " I deliberately left it out of this one as it may, depending on our choices, impinge upon the design of the new PRNG subsystem, which I declared out of scope for this NEP. I have ideas (besides the glib "Let them eat AttributeErrors!"), and now that I think more about it, that does seem like it might be in scope just like the discussion of freezing RandomState and StableRandom are. But I think I'd like to hold that thought a little bit and get a little more screaming^Wfeedback on the core proposal first. I'll return to this in a few days if not sooner. " So consider this some screaming^Wfeedback:) > > >> To properly use pseudorandom numbers, I need to instantiate a PRNG and > thread it through all of the code in my program: both the parts that I > write and the third party libraries that I don't write. > >> > >> Generating test data for unit tests is separable, though. That's why I > propose having a StableRandom built on the new architecture. Its purpose > would be well-documented, and in my proposal is limited in features such > that it will be less likely to be abused outside of that purpose. If you > make it fully-featured, it is more likely to be abused by building library > code around it. But even if it is so abused, because it is built on the new > architecture, at least I can thread the same core PRNG state through the > StableRandom distributions from the abusing library and use the better > distributions class elsewhere (randomgen names it "Generator"). Just > keeping RandomState around can't work like that because it doesn't have a > replaceable core PRNG. > >> > >> But that does suggest another alternative that we should explore: > >> > >> The new architecture separates the core uniform PRNG from the wide > variety of non-uniform probability distributions. That is, the core PRNG > state is encapsulated in a discrete object that can be shared between > instances of different distribution-providing classes. numpy.random should > provide two such distribution-providing classes. The main one (let us call > it ``Generator``, as it is called in the prototype) will follow the new > policy: distribution methods can break the stream in feature releases. > There will also be a secondary distributions class (let us call it > ``LegacyGenerator``) which contains distribution methods exactly as they > exist in the current ``RandomState`` implementation. When one combines > ``LegacyGenerator`` with the MT19937 core PRNG, it should reproduce the > exact same stream as ``RandomState`` for all distribution methods. The > ``LegacyGenerator`` methods will be forever frozen. > ``numpy.random.RandomState()`` will instantiate a ``LegacyGenerator`` with > the MT19937 core PRNG, and whatever tricks needed to make > ``isinstance(prng, RandomState)`` and unpickling work should be done. This > way of creating the ``LegacyGenerator`` by way of ``RandomState`` will be > deprecated, becoming progressively noisier over a number of release cycles, > in favor of explicitly instantiating ``LegacyGenerator``. > >> > >> ``LegacyGenerator`` CAN be used during this deprecation period in > library and application code until libraries and applications can migrate > to the new ``Generator``. Libraries and applications SHOULD migrate but > MUST NOT be forced to. ``LegacyGenerator`` CAN be used to generate test > data for unit tests where cross-release stability of the streams is > important. Test writers SHOULD consider ways to mitigate their reliance on > such stability and SHOULD limit their usage to distribution methods that > have fewer cross-platform stability risks. > > I would appreciate your consideration of this proposal. Does it address > your concerns? It addresses my concerns with keeping around a > fully-functional RandomState implementation. > My concerns are: 1. The amount of work caused by making libraries and end users migrate. 2. That this is a backwards compatibility break, which will cause problems for users who relied on the old guarantees (the arguments in the NEP that the old guarantees weren't 100% watertight don't mean that backcompat doesn't matter at all). As far as I can tell, this new proposal doesn't deal with those concerns directly. What it does seem to do is making transitioning a bit easier for users that were already using RandomState instances. Cheers, Ralf
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion