On Mon, Oct 28, 2024 at 12:11 AM Andrew Nelson via NumPy-Discussion < numpy-discussion@python.org> wrote:
> Hi all, > is there a canonical way of serialising Generators (not via pickle). > Pickle is the canonical way. In particular, the pickle _machinery_ is going to be the single source of truth for serialization. But I'll recapitulate the important details here. `Generator` itself is a stateless wrapper around the `BitGenerator`, so you can ignore it and focus on the `BitGenerator` and its `SeedSequence` (at that, really, only in the case you need to continue spawning). `BitGenerator.__getstate__()` will give you `(bg_state_dict, seed_seq)` (a `dict` and the `SeedSequence`). The `BitGenerator` state is going to be a simple dict, but there will be arbitrary-sized ints and/or numpy arrays in there. Third-party `BitGenerator`s can do whatever they like, so pickle's about your only fallback there. For numpy `BitGenerator`s, the name of the class will be in the key `'bit_generator'`, but for third-party ones, there's no place for you too look by name. Notionally, `BitGenerator`s can accept any `ISeedSequence` implementation, to allow for other kinds of seed massaging algorithms, but we haven't seen much (any) interest in that, so you can probably safely consider `SeedSequence` proper. Because Cython does default pickle stuff under the covers which was sufficient, we don't have an explicit `__getstate__()` for you to peruse, but you can look at our constructor, which indicates what can be passed; each one ends up as an attribute with the same name. You'll want all of them for serialization purposes. Note that `entropy` can be an arbitrary-sized int or a sequence of arbitrary-sized ints. If a sequence, it could be a list or a tuple, but this does not need to be preserved; any sequence type will do on deserialization. `spawn_key` should always be a tuple of bounded-size ints (each should fit into a uint32, but will be a plain Python int). `pool_size` is important configuration, though rarely modified (and should be either 4 or 8, but notionally folks could choose otherwise). `n_children_spawned` is important state if there's been spawning and you want to continue spawning later correctly/reproducibly. But that's it. So your ultimate serialization needs to handle arbitrary-sized integers, lists/tuples of integers, and numpy arrays (of various integer dtypes). JSON with a careful encoder/decoder that can handle those cases would be fine. Would the following be reasonable for saving and restoring state: > No, you're missing all of the actual state. The `entropy` of the `SeedSequence` is the original user-input seed, not the current state of the `BitGenerator`. > Specifically I'm interested in a safe way (i.e. no pickle) of saving/restoring Generator state via HDF5 file storage. By restricting your domain to only numpy-provided `BitGenerator`s and true `SeedSequence`s, you can do this. In full generality, one cannot. Untested: ``` def rng_dict(rng): bg_state = rng.bit_generator.state ss = rng.bit_generator.seed_seq ss_dict = dict(entropy=ss.entropy, spawn_key=ss.spawn_key, pool_size=ss.pool_size, n_children_spawned=ss.n_children_spawned) return dict(bg_state=bg_state, seed_seq=ss_dict) def rng_fromdict(d): bg_state = d['bg_state'] ss = np.random.SeedSequence(**d['seed_seq']) bg = getattr(np.random, bg_state['bit_generator'])(ss) bg.state = bg_state rng = np.random.Generator(bg) return rng ``` -- Robert Kern
_______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com