On Tue, May 17, 2016 at 9:09 AM, Stephan Hoyer <sho...@gmail.com> wrote: > > On Tue, May 17, 2016 at 12:18 AM, Robert Kern <robert.k...@gmail.com> wrote: >> >> On Tue, May 17, 2016 at 4:54 AM, Stephan Hoyer <sho...@gmail.com> wrote: >> > 1. When writing a library of stochastic functions that take a seed as an input argument, and some of these functions call multiple other such stochastic functions. Dask is one such example [1]. >> >> Can you clarify the use case here? I don't really know what you are doing here, but I'm pretty sure this is not the right approach. > > Here's a contrived example. Suppose I've written a simulator for cars that consists of a number of loosely connected components (e.g., an engine, brakes, etc.). The behavior of each component of our simulator is stochastic, but we want everything to be fully reproducible, so we need to use seeds or RandomState objects. > > We might write our simulate_car function like the following: > > def simulate_car(engine_config, brakes_config, seed=None): > rs = np.random.RandomState(seed) > engine = simulate_engine(engine_config, seed=rs.random_seed()) > brakes = simulate_brakes(brakes_config, seed=rs.random_seed()) > ... > > The problem with passing the same RandomState object (either explicitly or dropping the seed argument entirely and using the global state) to both simulate_engine and simulate_breaks is that it breaks encapsulation -- if I change what I do inside simulate_engine, it also effects the brakes.
That's a little too contrived, IMO. In most such simulations, the different components interact with each other in the normal course of the simulation; that's why they are both joined together in the same simulation instead of being two separate runs. Unless if the components are being run across a process or thread boundary (a la dask below) where true nondeterminism comes into play, then I don't think you want these semi-independent streams. This seems to be the advice du jour from the agent-based modeling community. > The dask use case is actually pretty different -- the intent is to create many random numbers in parallel using multiple threads or processes (possibly in a distributed fashion). I know that skipping ahead is the standard way to get independent number streams for parallel sampling, but that isn't exposed in numpy.random, and setting distinct seeds seems like a reasonable alternative for scientific computing use cases. Forget about integer seeds. Those are for human convenience. If you're not jotting them down in your lab notebook in pen, you don't want an integer seed. What you want is a function that returns many RandomState objects that are hopefully spread around the MT19937 space enough that they are essentially independent (in the absence of true jumpahead). The better implementation of such a function would look something like this: def spread_out_prngs(n, root_prng=None): if root_prng is None: root_prng = np.random elif not isinstance(root_prng, np.random.RandomState): root_prng = np.random.RandomState(root_prng) sprouted_prngs = [] for i in range(n): seed_array = root_prng.randint(1<<32, size=624) # dtype=np.uint32 under 1.11 sprouted_prngs.append(np.random.RandomState(seed_array)) return spourted_prngs Internally, this generates seed arrays of about the size of the MT19937 state so make sure that you can access more of the state space. That will at least make the chance of collision tiny. And it can be easily rewritten to take advantage of one of the newer PRNGs that have true independent streams: https://github.com/bashtage/ng-numpy-randomstate -- Robert Kern
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion