On Fri, Nov 17, 2023 at 12:10 PM Robert Kern <robert.k...@gmail.com> wrote: > > On Fri, Nov 17, 2023 at 1:54 PM Stefan van der Walt via NumPy-Discussion > <numpy-discussion@python.org> wrote: >> >> Hi all, >> >> I am trying to sample k N-dimensional vectors from a uniform distribution >> without replacement. >> It seems like this should be straightforward, but I can't seem to pin it >> down. >> >> Specifically, I am trying to get random indices in an d0 x d1 x d2.. x dN-1 >> array. >> >> I thought about sneaking in a structured dtype into `rng.integers`, but of >> course that doesn't work. >> >> If we had a string sampler, I could sample k unique words (consisting of >> digits), and convert them to indices. >> >> I could over-sample and filter out the non-unique indices. Or iteratively >> draw blocks of samples until I've built up my k unique indices. >> >> The most straightforward solution would be to flatten indices, and to sample >> from those. The integers get large quickly, though. The rng.integers >> docstring suggests that it can handle object arrays for very large integers: >> >> > When using broadcasting with uint64 dtypes, the maximum value (2**64) >> > cannot be represented as a standard integer type. >> > The high array (or low if high is None) must have object dtype, e.g., >> > array([2**64]). >> >> But, that doesn't work: >> >> In [35]: rng.integers(np.array([2**64], dtype=object)) >> ValueError: high is out of bounds for int64 >> >> Is there an elegant way to handle this problem? > > > The default dtype for the result of `integers()` is the signed `int64`. If > you want to sample from the range `[0, 2**64)`, you need to specify > `dtype=np.uint64`. The text you are reading is saying that if you want to > specify exactly `2**64` as the exclusive upper bound that you won't be able > to do it with a `np.uint64` array/scalar because it's one above the bound for > that dtype, so you'll have to use a plain Python `int` object or > `dtype=object` array in order to represent `2**64`. It is not saying that you > can draw arbitrary-sized integers. > > >>> rng.integers(2**64, dtype=np.uint64) > 11569248114014186612 > > If the arrays you are drawing indices for are real in-memory arrays for > present-day 64-bit computers, this should be adequate. If it's a notional > array that is larger, then you'll need actual arbitrary-sized integer > sampling. The builtin `random.randrange()` will do arbitrary-sized integers > and is quite reasonable for this task. If you want it to use our > BitGenerators underneath for clean PRNG state management, this is quite > doable with a simple subclass of `random.Random`: > https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258
Wouldn't it be better to just use random.randint to generate the vectors directly at that point? If the number of possibilities is more than 2**64, birthday odds of generating the same vector twice are on the order of 1 in 2**32. And you can always do a unique() rejection check if you really want to be careful. Aaron Meurer > > -- > Robert Kern > _______________________________________________ > NumPy-Discussion mailing list -- numpy-discussion@python.org > To unsubscribe send an email to numpy-discussion-le...@python.org > https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ > Member address: asmeu...@gmail.com _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com