On Mon, Oct 28, 2024 at 12:11 AM Andrew Nelson via NumPy-Discussion <
numpy-discussion@python.org> wrote:

> Hi all,
> is there a canonical way of serialising Generators (not via pickle).
>

Pickle is the canonical way. In particular, the pickle _machinery_ is going
to be the single source of truth for serialization. But I'll recapitulate
the important details here.

`Generator` itself is a stateless wrapper around the `BitGenerator`, so you
can ignore it and focus on the `BitGenerator` and its `SeedSequence` (at
that, really, only in the case you need to continue spawning).
`BitGenerator.__getstate__()` will give you `(bg_state_dict, seed_seq)` (a
`dict` and the `SeedSequence`). The `BitGenerator` state is going to be a
simple dict, but there will be arbitrary-sized ints and/or numpy arrays in
there. Third-party `BitGenerator`s can do whatever they like, so pickle's
about your only fallback there. For numpy `BitGenerator`s, the name of the
class will be in the key `'bit_generator'`, but for third-party ones,
there's no place for you too look by name.

Notionally, `BitGenerator`s can accept any `ISeedSequence` implementation,
to allow for other kinds of seed massaging algorithms, but we haven't seen
much (any) interest in that, so you can probably safely consider
`SeedSequence` proper. Because Cython does default pickle stuff under the
covers which was sufficient, we don't have an explicit `__getstate__()` for
you to peruse, but you can look at our constructor, which indicates what
can be passed; each one ends up as an attribute with the same name. You'll
want all of them for serialization purposes. Note that `entropy` can be an
arbitrary-sized int or a sequence of arbitrary-sized ints. If a sequence,
it could be a list or a tuple, but this does not need to be preserved; any
sequence type will do on deserialization. `spawn_key` should always be a
tuple of bounded-size ints (each should fit into a uint32, but will be a
plain Python int). `pool_size` is important configuration, though rarely
modified (and should be either 4 or 8, but notionally folks could choose
otherwise). `n_children_spawned` is important state if there's been
spawning and you want to continue spawning later correctly/reproducibly.
But that's it.

So your ultimate serialization needs to handle arbitrary-sized integers,
lists/tuples of integers, and numpy arrays (of various integer dtypes).
JSON with a careful encoder/decoder that can handle those cases would be
fine.

Would the following be reasonable for saving and restoring state:
>

No, you're missing all of the actual state. The `entropy` of the
`SeedSequence` is the original user-input seed, not the current state of
the `BitGenerator`.

> Specifically I'm interested in a safe way (i.e. no pickle) of
saving/restoring Generator state via HDF5 file storage.

By restricting your domain to only numpy-provided `BitGenerator`s and true
`SeedSequence`s, you can do this. In full generality, one cannot.

Untested:

```
def rng_dict(rng):
    bg_state = rng.bit_generator.state
    ss = rng.bit_generator.seed_seq
    ss_dict = dict(entropy=ss.entropy, spawn_key=ss.spawn_key,
pool_size=ss.pool_size, n_children_spawned=ss.n_children_spawned)
    return dict(bg_state=bg_state, seed_seq=ss_dict)

def rng_fromdict(d):
    bg_state = d['bg_state']
    ss = np.random.SeedSequence(**d['seed_seq'])
    bg = getattr(np.random, bg_state['bit_generator'])(ss)
    bg.state = bg_state
    rng = np.random.Generator(bg)
    return rng
```

-- 
Robert Kern
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

Reply via email to