Re: [Numpy-discussion] SeedSequence.spawn()

Stig Korsnes Sun, 29 Aug 2021 03:56:59 -0700

Thanks again Robert!
Got rid of dict(state).

Not sure I followed you completely on the test case. The "calculator" i am
writing , will for the specific use case depend on ~200-1000 processes.
Each process object will return say 1m floats when its method scenario is
called. If I am not mistaken, that would require 7-8GiB just to keep the
these in memory. Furthermore I would possibly have to add the size of the
dependent calculation on these (but would likely aggregate outside of
testing).  A given object that depends on processes will calculate its
results based on 1-4 (1-4 *1m  of these processes (non multiproc)), and
will loop over objects with processpool. So my reasoning is that running
memory consumption would then be (1-4)*size of 1m floats x processes + all
of other overhead. Since sampling 1m normals is pretty fast, I can happily
live with sampling (vs lookup in presampled array), but since two object
might depend on the same process they need the exact same array of samples.
Hence the state. If I understood you correctly, another solution is to add
another duplicate process with same seed, instead of using one where i
"reset" state.


I promised that this could run on any laptop..



søn. 29. aug. 2021 kl. 02:42 skrev Robert Kern <[email protected]>:

> On Sat, Aug 28, 2021 at 5:56 AM Stig Korsnes <[email protected]>
> wrote:
>
>> Thank you again Robert.
>> I am using NamedTuple for mye keys, which also are keys in a dictionary.
>> Each key will be unique (tuple on distinct int and enum), so I am thinking
>> maybe the risk of producing duplicate hash is not present, but could as
>> always be wrong :)
>>
>
> Present, but possibly ignorably small. 128-bit spaces give enough
> breathing room for me to be comfortable; 64-bit spaces like what hash()
> will use for its results makes me just a little claustrophobic.
>
> If the structure of the keys is pretty fixed, just these two integers
> (counting the enum as an integer), then I might just use both in the
> seeding material.
>
> def get_key_seed(key:ComponentId, root_seed:int):
>     return np.random.SeedSequence([key.the_int, int(key.the_enum),
> root_seed])
>
>
>> For positive ints i followed this tip
>> https://stackoverflow.com/questions/18766535/positive-integer-from-python-hash-function
>> , and did:
>>
>> def stronghash(key:ComponentId):
>>     return ctypes.c_size_t(hash(key)).value
>>
>
> np.uint64(possibly_negative_integer) will also work for this purpose
> (somewhat more reliably).
>
> Since I will be using each process/random sample several times, and
>> keeping all of them in memory at once is not feasible (dimensionality) i
>> did the following:
>>
>>         self._rng = default_rng(cs)
>>         self._state = dict(self._rng.bit_generator.state)  #
>>
>>     def scenarios(self) -> npt.NDArray[np.float64]:
>>         self._rng.bit_generator.state = self._state
>>        ....
>>       return ....
>>
>> Would you consider this bad practice, or an ok solution?
>>
>
> It's what that property is there for. No need to copy; `.state` creates a
> new dict each time.
>
> In a quick test, I measured a process with 1 million Generator instances
> to use ~1.5 GiB while 1 million state dicts ~1.0 GiB (including all of the
> other overhead of Python and numpy; not a scientific test). Storing just
> the BitGenerator is half-way in between. That's something, but not a huge
> win. If that is really crossing the border from feasible to infeasible, you
> may be about to run into your limits anyways for other reasons. So balance
> that out with the complications of swapping state in and out of a single
> instance.
>
> I Norway we have a saying which directly translates :" He asked for the
>> finger... and took the whole arm" .
>>
>
> Well, when I craft an overly-complicated system, I feel responsible to
> help shepherd people along in using it well. :-)
>
> --
> Robert Kern
> _______________________________________________
> NumPy-Discussion mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/numpy-discussion
>

_______________________________________________
NumPy-Discussion mailing list
[email protected]
https://mail.python.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] SeedSequence.spawn()

Reply via email to