On Thu, Aug 26, 2021 at 2:22 PM Stig Korsnes <stigkors...@gmail.com> wrote:
> Hi, > Is there a way to uniquely spawn child seeds? > I`m doing monte carlo analysis, where I have n random processes, each with > their own generator. > All process models instantiate a generator with default_rng(). I.e > ss=SeedSequence() cs=ss.Spawn(n), and using cs[i] for process i. Now, the > problem I`m facing, is that results using individual process depends on > the order of the process initialization ,and the number of processes used. > However, if I could spawn children with a unique identifier, I would be > able to reproduce my individual results without having to pickle/log > states. For example, all my models have an id (tuple) field which is > hashable. > If I had the ability to SeedSequence(x).Spawn([objects]) where objects > support hash(object), I would have reproducibility for all my processes. I > could do without the spawning, but then I would probably loose independence > when I do multiproc? Is there a way to achieve my goal in the current > version 1.21 of numpy? > I would probably not rely on `hash()` as it is only intended to be pretty good at getting distinct values from distinct inputs. If you can combine the tuple objects into a string of bytes in a reliable, collision-free way and use one of the cryptographic hashes to get them down to a 128bit number, that'd be ideal. `int(joblib.hash(key) <https://joblib.readthedocs.io/en/latest/generated/joblib.hash.html>, 16)` should do nicely. You can combine that with your main process's seed easily. SeedSequence can take arbitrary amounts of integer data and smoosh them all together. The spawning functionality builds off of that, but you can also just manually pass in lists of integers. Let's call that function `stronghash()`. Let's call your main process seed number `seed` (this is the thing that the user can set on the command-line or something you get from `secrets.randbits(128)` if you need a fresh one). Let's call the unique tuple `key`. You can build the `SeedSequence` for each job according to the `key` like so: root_ss = SeedSequence(seed) for key, data in jobs: child_ss = SeedSequence([stronghash(key), seed]) submit_job(key, data, seed=child_ss) Now each job will get its own unique stream regardless of the order the job is assigned. When the user reruns it with the same root `seed`, they will get the same results. When the user chooses a different `seed`, they will get another set of results (this is why you don't want to just use `SeedSequence(stronghash(key))` all by itself). I put the job-specific seed data ahead of the main program's seed to be on the super-safe side. The spawning mechanism will append integers to the end, so there's a super-tiny chance somewhere down a long line of `root_ss.spawn()`s that there would be a collision (and I mean super-extra-tiny). But best practices cost nothing. I hope that helps and is not too confusing! -- Robert Kern
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@python.org https://mail.python.org/mailman/listinfo/numpy-discussion