Thank you Robert!
This scheme fits perfectly into what I`m trying to accomplish! :) The
"smooshing" of ints by supplying a list of ints had eluded me. Thank you
also for the pointer about built-in hash(). I would not be able to rely on
it anyways, because it does not return strictly positive ints which
SeedSequence requires.  If you have a minute to spare: Could you briefly
explain "int(joblib.hash(key)
<https://joblib.readthedocs.io/en/latest/generated/joblib.hash.html>, 16)"
, and would this always return non-negative integers?
Thanks again!

tor. 26. aug. 2021 kl. 22:59 skrev Robert Kern <robert.k...@gmail.com>:

> On Thu, Aug 26, 2021 at 2:22 PM Stig Korsnes <stigkors...@gmail.com>
> wrote:
>
>> Hi,
>> Is there a way to uniquely spawn child seeds?
>> I`m doing monte carlo analysis, where I have n random processes, each
>> with their own generator.
>> All process models instantiate a generator with default_rng(). I.e
>> ss=SeedSequence() cs=ss.Spawn(n), and using cs[i] for process i. Now, the
>> problem I`m facing, is that results using individual process  depends on
>> the order of the process initialization ,and the number of processes used.
>> However, if I could spawn children with a unique identifier, I would be
>> able to reproduce my individual results without having to pickle/log
>> states. For example, all my models have an id (tuple) field which is
>> hashable.
>> If I had the ability to SeedSequence(x).Spawn([objects]) where objects
>> support hash(object), I would have reproducibility for all my processes. I
>> could do without the spawning, but then I would probably loose independence
>> when I do multiproc? Is there a way to achieve my goal in the current
>> version 1.21 of numpy?
>>
>
> I would probably not rely on `hash()` as it is only intended to be pretty
> good at getting distinct values from distinct inputs. If you can combine
> the tuple objects into a string of bytes in a reliable, collision-free way
> and use one of the cryptographic hashes to get them down to a 128bit
> number, that'd be ideal. `int(joblib.hash(key)
> <https://joblib.readthedocs.io/en/latest/generated/joblib.hash.html>,
> 16)` should do nicely. You can combine that with your main process's seed
> easily. SeedSequence can take arbitrary amounts of integer data and smoosh
> them all together. The spawning functionality builds off of that, but you
> can also just manually pass in lists of integers.
>
> Let's call that function `stronghash()`. Let's call your main process seed
> number `seed` (this is the thing that the user can set on the command-line
> or something you get from `secrets.randbits(128)` if you need a fresh one).
> Let's call the unique tuple `key`. You can build the `SeedSequence` for
> each job according to the `key` like so:
>
> root_ss = SeedSequence(seed)
> for key, data in jobs:
>     child_ss = SeedSequence([stronghash(key), seed])
>     submit_job(key, data, seed=child_ss)
>
> Now each job will get its own unique stream regardless of the order the
> job is assigned. When the user reruns it with the same root `seed`, they
> will get the same results. When the user chooses a different `seed`, they
> will get another set of results (this is why you don't want to just use
> `SeedSequence(stronghash(key))` all by itself).
>
> I put the job-specific seed data ahead of the main program's seed to be on
> the super-safe side. The spawning mechanism will append integers to the
> end, so there's a super-tiny chance somewhere down a long line of
> `root_ss.spawn()`s that there would be a collision (and I mean
> super-extra-tiny). But best practices cost nothing.
>
> I hope that helps and is not too confusing!
>
> --
> Robert Kern
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion@python.org
> https://mail.python.org/mailman/listinfo/numpy-discussion
>
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@python.org
https://mail.python.org/mailman/listinfo/numpy-discussion

Reply via email to