[Numpy-discussion] Re: How to sample unique vectors

Aaron Meurer Fri, 17 Nov 2023 13:16:39 -0800

On Fri, Nov 17, 2023 at 12:10 PM Robert Kern <robert.k...@gmail.com> wrote:
>
> On Fri, Nov 17, 2023 at 1:54 PM Stefan van der Walt via NumPy-Discussion 
> <numpy-discussion@python.org> wrote:
>>
>> Hi all,
>>
>> I am trying to sample k N-dimensional vectors from a uniform distribution 
>> without replacement.
>> It seems like this should be straightforward, but I can't seem to pin it 
>> down.
>>
>> Specifically, I am trying to get random indices in an d0 x d1 x d2.. x dN-1 
>> array.
>>
>> I thought about sneaking in a structured dtype into `rng.integers`, but of 
>> course that doesn't work.
>>
>> If we had a string sampler, I could sample k unique words (consisting of 
>> digits), and convert them to indices.
>>
>> I could over-sample and filter out the non-unique indices. Or iteratively 
>> draw blocks of samples until I've built up my k unique indices.
>>
>> The most straightforward solution would be to flatten indices, and to sample 
>> from those. The integers get large quickly, though. The rng.integers 
>> docstring suggests that it can handle object arrays for very large integers:
>>
>> > When using broadcasting with uint64 dtypes, the maximum value (2**64)
>> > cannot be represented as a standard integer type.
>> > The high array (or low if high is None) must have object dtype, e.g., 
>> > array([2**64]).
>>
>> But, that doesn't work:
>>
>> In [35]: rng.integers(np.array([2**64], dtype=object))
>> ValueError: high is out of bounds for int64
>>
>> Is there an elegant way to handle this problem?
>
>
> The default dtype for the result of `integers()` is the signed `int64`. If 
> you want to sample from the range `[0, 2**64)`, you need to specify 
> `dtype=np.uint64`. The text you are reading is saying that if you want to 
> specify exactly `2**64` as the exclusive upper bound that you won't be able 
> to do it with a `np.uint64` array/scalar because it's one above the bound for 
> that dtype, so you'll have to use a plain Python `int` object or 
> `dtype=object` array in order to represent `2**64`. It is not saying that you 
> can draw arbitrary-sized integers.
>
> >>> rng.integers(2**64, dtype=np.uint64)
> 11569248114014186612
>
> If the arrays you are drawing indices for are real in-memory arrays for 
> present-day 64-bit computers, this should be adequate. If it's a notional 
> array that is larger, then you'll need actual arbitrary-sized integer 
> sampling. The builtin `random.randrange()` will do arbitrary-sized integers 
> and is quite reasonable for this task. If you want it to use our 
> BitGenerators underneath for clean PRNG state management, this is quite 
> doable with a simple subclass of `random.Random`: 
> https://github.com/numpy/numpy/issues/24458#issuecomment-1685022258


Wouldn't it be better to just use random.randint to generate the
vectors directly at that point? If the number of possibilities is more
than 2**64, birthday odds of generating the same vector twice are on
the order of 1 in 2**32. And you can always do a unique() rejection
check if you really want to be careful.

Aaron Meurer

>
> --
> Robert Kern
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: asmeu...@gmail.com
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

[Numpy-discussion] Re: How to sample unique vectors

Reply via email to