Re: [Numpy-discussion] SeedSequence.spawn()

Robert Kern Thu, 26 Aug 2021 13:58:29 -0700

On Thu, Aug 26, 2021 at 2:22 PM Stig Korsnes <stigkors...@gmail.com> wrote:


> Hi,
> Is there a way to uniquely spawn child seeds?
> I`m doing monte carlo analysis, where I have n random processes, each with
> their own generator.
> All process models instantiate a generator with default_rng(). I.e
> ss=SeedSequence() cs=ss.Spawn(n), and using cs[i] for process i. Now, the
> problem I`m facing, is that results using individual process  depends on
> the order of the process initialization ,and the number of processes used.
> However, if I could spawn children with a unique identifier, I would be
> able to reproduce my individual results without having to pickle/log
> states. For example, all my models have an id (tuple) field which is
> hashable.
> If I had the ability to SeedSequence(x).Spawn([objects]) where objects
> support hash(object), I would have reproducibility for all my processes. I
> could do without the spawning, but then I would probably loose independence
> when I do multiproc? Is there a way to achieve my goal in the current
> version 1.21 of numpy?
>

I would probably not rely on `hash()` as it is only intended to be pretty
good at getting distinct values from distinct inputs. If you can combine
the tuple objects into a string of bytes in a reliable, collision-free way
and use one of the cryptographic hashes to get them down to a 128bit
number, that'd be ideal. `int(joblib.hash(key)
<https://joblib.readthedocs.io/en/latest/generated/joblib.hash.html>, 16)`
should do nicely. You can combine that with your main process's seed
easily. SeedSequence can take arbitrary amounts of integer data and smoosh
them all together. The spawning functionality builds off of that, but you
can also just manually pass in lists of integers.

Let's call that function `stronghash()`. Let's call your main process seed
number `seed` (this is the thing that the user can set on the command-line
or something you get from `secrets.randbits(128)` if you need a fresh one).
Let's call the unique tuple `key`. You can build the `SeedSequence` for
each job according to the `key` like so:

root_ss = SeedSequence(seed)
for key, data in jobs:
    child_ss = SeedSequence([stronghash(key), seed])
    submit_job(key, data, seed=child_ss)

Now each job will get its own unique stream regardless of the order the job
is assigned. When the user reruns it with the same root `seed`, they will
get the same results. When the user chooses a different `seed`, they will
get another set of results (this is why you don't want to just use
`SeedSequence(stronghash(key))` all by itself).

I put the job-specific seed data ahead of the main program's seed to be on
the super-safe side. The spawning mechanism will append integers to the
end, so there's a super-tiny chance somewhere down a long line of
`root_ss.spawn()`s that there would be a collision (and I mean
super-extra-tiny). But best practices cost nothing.

I hope that helps and is not too confusing!

-- 
Robert Kern

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] SeedSequence.spawn()

Reply via email to