On Sun, Jun 10, 2018 at 10:36 PM, Robert Kern <robert.k...@gmail.com> wrote:
> On Sun, Jun 10, 2018 at 8:04 PM Ralf Gommers <ralf.gomm...@gmail.com> > wrote: > > > > On Sun, Jun 10, 2018 at 6:08 PM, Robert Kern <robert.k...@gmail.com> > wrote: > >> > >> On Sun, Jun 10, 2018 at 5:27 PM Ralf Gommers <ralf.gomm...@gmail.com> > wrote: > >> > > >> > On Mon, Jun 4, 2018 at 3:18 PM, Robert Kern <robert.k...@gmail.com> > wrote: > >> >> > >> >> On Sun, Jun 3, 2018 at 8:22 PM Ralf Gommers <ralf.gomm...@gmail.com> > wrote: > >> >>> > >> >>> It may be worth having a look at test suites for scipy, > statsmodels, scikit-learn, etc. and estimate how much work this NEP causes > those projects. If the devs of those packages are forced to do large scale > migrations from RandomState to StableState, then why not instead keep > RandomState and just add a new API next to it? > >> >> > >> >> The problem is that we can't really have an ecosystem with two > different general purpose systems. > >> > > >> > Can't = prefer not to. > >> > >> I meant what I wrote. :-) > >> > >> > But yes, that's true. That's not what I was saying though. We want > one generic one, and one meant for unit testing only. You can achieve that > in two ways: > >> > 1. Change the current np.random API to new generic, and add a new > RandomStable for unit tests. > >> > 2. Add a new generic API, and document the current np.random API as > being meant for unit tests only, for other usage <new API> should be > preferred. > >> > > >> > (2) has a couple of pros: > >> > - you're not forcing almost every library and end user out there to > migrate their unit tests. > >> > >> But it has the cons that I talked about. RandomState *is* a fully > functional general purpose PRNG system. After all, that's its current use. > Documenting it as intended to be something else will not change that fact. > Documentation alone provides no real impetus to move to the new system > outside of the unit tests. And the community does need to move together to > the new system in their library code, or else we won't be able to combine > libraries together; these PRNG objects need to thread all the way through > between code from different authors if we are to write programs with a > controlled seed. The failure mode when people don't pay attention to the > documentation is that I can no longer write programs that compose these > libraries together. That's why I wrote "can't". It's not a mere preference > for not having two systems to maintain. It has binary Go/No Go implications > for building reproducible programs. > > > > I strongly suspect you are right, but only because you're asserting > "can't" so heavily. I have trouble formulating what would go wrong in case > there's two PRNGs used in a single program. It's not described in the NEP, > nor in the numpy.random docs (those don't even have any recommendations for > best practices listed as far as I can tell - that needs fixing). All you > explain in the NEP is that reproducible research isn't helped by the > current stream-compat guarantee. So a bit of (probably incorrect) devil's > advocate reasoning: > > - If there's no stream-compat guarantee, all a user can rely on is the > properties of drawing from a seeded PRNG. > > - Any use of a PRNG in library code can also only rely on properties > > - So now whether in a user's program libraries draw from one or two > seeded PRNGs doesn't matter for reproducibility, because those properties > don't change. > > Correctly making a stochastic program reproducible while retaining good > statistical properties is difficult. People don't do it well in the best of > circumstances. The best way that we've found to manage that difficulty is > to instantiate a single stream and use it all throughout your code. Every > new stream requires the management of more seeds (unless if we use the > fancy new algorithms that have settable stream IDs, but by stipulation, we > don't have these in this case). And now I have to thread both of these > objects through my code, and pass the right object to each third-party > library. These third-party libraries don't know anything about this weird > 2-stream workaround that you are doing, so we now have libraries that can't > build on each other unless if they are using the same compatible API, even > if I can make workarounds to build a program that combines two libraries > side-to-side. > > So yeah, people "can" do this. "It's just a matter of code" as my boss > likes to say. But it's making an already-difficult task more difficult. > Okay, that makes more sense to me now. It would be really useful to document such best practices and rationales. Note that scipy.stats distributions allow passing in either a RandomState instance or an integer as seed (which will be used for seeding a new instance, not for np.random.seed) [1]. That seems like a fine design pattern as well, and passing on a seed that way is fairly easy and as good for reproducibility as passing in a single PRNG. [1] https://github.com/scipy/scipy/blob/master/scipy/stats/_distn_infrastructure.py#L612 > > Also, if there is to be a multi-year transitioning to the new API, would > there be two PRNG systems anyway during those years? > > Sure, but with a deadline and not-just-documentation to motivate > transitioning. > > But if we follow my alternative proposal, there'll be no need for > deprecation! You've convinced me to not deprecate RandomState. > That's not how I had read it, but great to hear that! I just want to change some of its internal implementation details, add a > less-stable set of distributions on the side, and a framework of core > uniform PRNGs that can be shared by both. > > >> > - more design freedom for the new generic API. The current one is > clearly sub-optimal; in a new one you wouldn't have to expose all the > global state/functions that np.random exposes now. You could even restrict > it to a single class and put that in the main numpy namespace. > >> > >> I'm not sure why you are talking about the global state and np.random.* > convenience functions. What we do with those functions is out of scope for > this NEP and would be talked about it another NEP fully introducing the new > system. > > > > To quote you from one of the first emails in this thread: " > > I deliberately left it out of this one as it may, depending on our > choices, impinge upon the design of the new PRNG subsystem, which I > declared out of scope for this NEP. I have ideas (besides the glib "Let > them eat AttributeErrors!"), and now that I think more about it, that does > seem like it might be in scope just like the discussion of freezing > RandomState and StableRandom are. But I think I'd like to hold that thought > a little bit and get a little more screaming^Wfeedback on the core proposal > first. I'll return to this in a few days if not sooner. > > " > > > > So consider this some screaming^Wfeedback:) > > Ahem. Yes, I just remembered I said that. :-) But still, there will be > lots of options about what to do with np.random.*, whatever proposal we go > with. It doesn't really impose constraints on the core proposals. > > >> >> To properly use pseudorandom numbers, I need to instantiate a PRNG > and thread it through all of the code in my program: both the parts that I > write and the third party libraries that I don't write. > >> >> > >> >> Generating test data for unit tests is separable, though. That's why > I propose having a StableRandom built on the new architecture. Its purpose > would be well-documented, and in my proposal is limited in features such > that it will be less likely to be abused outside of that purpose. If you > make it fully-featured, it is more likely to be abused by building library > code around it. But even if it is so abused, because it is built on the new > architecture, at least I can thread the same core PRNG state through the > StableRandom distributions from the abusing library and use the better > distributions class elsewhere (randomgen names it "Generator"). Just > keeping RandomState around can't work like that because it doesn't have a > replaceable core PRNG. > >> >> > >> >> But that does suggest another alternative that we should explore: > >> >> > >> >> The new architecture separates the core uniform PRNG from the wide > variety of non-uniform probability distributions. That is, the core PRNG > state is encapsulated in a discrete object that can be shared between > instances of different distribution-providing classes. numpy.random should > provide two such distribution-providing classes. The main one (let us call > it ``Generator``, as it is called in the prototype) will follow the new > policy: distribution methods can break the stream in feature releases. > There will also be a secondary distributions class (let us call it > ``LegacyGenerator``) which contains distribution methods exactly as they > exist in the current ``RandomState`` implementation. When one combines > ``LegacyGenerator`` with the MT19937 core PRNG, it should reproduce the > exact same stream as ``RandomState`` for all distribution methods. The > ``LegacyGenerator`` methods will be forever frozen. > ``numpy.random.RandomState()`` will instantiate a ``LegacyGenerator`` with > the MT19937 core PRNG, and whatever tricks needed to make > ``isinstance(prng, RandomState)`` and unpickling work should be done. This > way of creating the ``LegacyGenerator`` by way of ``RandomState`` will be > deprecated, becoming progressively noisier over a number of release cycles, > in favor of explicitly instantiating ``LegacyGenerator``. > >> >> > >> >> ``LegacyGenerator`` CAN be used during this deprecation period in > library and application code until libraries and applications can migrate > to the new ``Generator``. Libraries and applications SHOULD migrate but > MUST NOT be forced to. ``LegacyGenerator`` CAN be used to generate test > data for unit tests where cross-release stability of the streams is > important. Test writers SHOULD consider ways to mitigate their reliance on > such stability and SHOULD limit their usage to distribution methods that > have fewer cross-platform stability risks. > >> > >> I would appreciate your consideration of this proposal. Does it address > your concerns? It addresses my concerns with keeping around a > fully-functional RandomState implementation. > > > > My concerns are: > > 1. The amount of work caused by making libraries and end users migrate. > > 2. That this is a backwards compatibility break, which will cause > problems for users who relied on the old guarantees (the arguments in the > NEP that the old guarantees weren't 100% watertight don't mean that > backcompat doesn't matter at all). > > > > As far as I can tell, this new proposal doesn't deal with those concerns > directly. What it does seem to do is making transitioning a bit easier for > users that were already using RandomState instances. > > Let me drop the deprecation of the name RandomState. RandomState(int_seed) > will forever and always create a backwards- and stream-compatible object. > No one will have to migrate. > > How does that strike you? > Sounds good. Cheers, Ralf
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion