On 6/16/18 05:54, josef.p...@gmail.com wrote:
On Sat, Jun 16, 2018 at 3:59 AM, Robert Kern <robert.k...@gmail.com> wrote:
I have made a significant revision. In this version, downstream projects
like scikit-learn should experience significantly less forced churn.

https://github.com/rkern/numpy/blob/nep/rng-clarification/doc/neps/nep-0019-rng-policy.rst

https://mail.python.org/pipermail/numpy-discussion/2018-June/078252.html

tl;dr RandomState lives! But its distributions are forever frozen. So maybe
"undead" is more apt. Anyways, RandomState will continue to provide the same
stream-compatibility that it always has. But it will be internally
refactored to use the same core uniform PRNG objects that the new
RandomGenerator distributions class will use underneath (defaulting to the
current Mersenne Twister, of course). The distribution methods on
RandomGenerator will be allowed to evolve with numpy versions and get
better/faster implementations.

Your code can mix the usage of RandomState and RandomGenerator as needed,
but they can be made to share the same underlying RNG algorithm's state.


Sounds good to me, and I think handles all our concerns.

I also think that the issues behind the np.random.* section about the
global state and seed can be revisited for possible deprecation of
convenience features.

One clarifying question, mainly to see IIUC

in this quote
"""
Calling numpy.random.seed() thereafter SHOULD just pass the given seed
to the current basic RNG object and not attempt to reset the basic RNG
to the Mersenne Twister. The global RandomState instance MUST be
accessible by the name numpy.random.mtrand._rand
"""

"the current basic RNG object" refers to the global object. AFAIU, it
is possible to change it numpy.random.mtrand._rand. Is it?

numpy.random.mtrand._rand would not be a basic RNG object; it would be (as it is now) a RandomState instance. "the current basic RNG object" would be the basic RNG that that global RandomState instance is currently using.

It is not possible (now or in the glorious NEP future) to assign a new instance to numpy.random.mtrand._rand. All of the numpy.random.* functions are actually just simple aliases to the methods on that object when the module is first built. Re-assigning _rand wouldn't reassign those aliases. numpy.random.standard_normal(), for instance, would still be the .standard_normal() method on the RandomState instance that _rand initially pointed to.

Currently and under the NEP, the only way to modify numpy.random.mtrand._rand is to call its methods (i.e. the numpy.random.* convenience functions) to modify its internal state. That's not changing.

The only thing that will change will be that there will be a new numpy.random.* function to call that will let you give the global RandomState a new basic RNG object that it will swap in internally. Let's call it np.random.swap_global_basic_rng(). If you don't use that function, you won't have a problem. I intend to make this new function *very* explicit about what it is doing, and document the crap out of it so it won't be misused like np.random.seed() is.

I never tried that so I didn't know we can change the global
RandomState, and thought we will have to replace np.random.seed usage > with a 
specific RandomState(seed) instance

I did a quick review of np.random.seed() usage in statsmodels, and I think you are mostly fine. It looks like you mostly use it in unit tests and at the top of examples. The only possible problem that I can see that you might have with the swap_global_basic_rng() is if some other package that you rely on calls it in its library code. Then subsequent statsmodels unit tests might fail because when they call np.random.seed(), it won't be reseeding a Mersenne Twister but another basic RNG.

However, I intend to make that a weird and unnatural thing to do. It's already unlikely to happen as it's a niche requirement that one mostly would need at the start of a whole program, not buried down inside library code. But we will also document that function to discourage such usage, and probably have unconditional noisy warnings that users would have to explicitly silence.

If one of your dependencies did that, you'd be well within your rights to tell them that they are misusing numpy and causing breakage in statsmodels.

In loose analogy:

Matplotlib has a "global" current figure and axis, gca, gcf.
In statsmodels we avoid any access to and usage of it and only work
with individual figure/axis instances that can be provided by the
user. (except for maybe some documentation examples and maybe some
"legacy" code.)
( 
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/graphics/utils.py#L48
)

AFAICS, essentially, statsmodels will need a similar policy for
RandomState/RandomGenerator and give up the usage of the global random
instance.

I mean, you certainly *should* (outside of unit tests) for very similar reasons why you avoid the global state in matplotlib, but this NEP won't force you to. You should do so anyways under the status quo, too. For any of your functions that call np.random.* functions internally, it's hard to use them in threaded applications, for instance, because it is relying on that global state.

scikit-learn's check_random_state() is a good pattern to follow.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth."
  -- Umberto Eco

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to