On Mon, Dec 1, 2025 at 10:44 AM Robert Kern <[email protected]> wrote:
>
> On Sun, Nov 30, 2025 at 9:11 PM Warren Weckesser via NumPy-Discussion 
> <[email protected]> wrote:
>>
>> NumPy implements the Zipf distribution (also known as the zeta
>> distribution), as `random.Generator.zipf(a, size=None)`. This is a
>> discrete distribution with infinite support {1, 2, 3, ...}. The PMF of
>> the distribution is
>>
>>     p(k, a) = k**-a / zeta(a), a > 1, k = 1, 2, 3, ...,
>>
>> where zeta(a) is the Riemann zeta function
>> (https://en.wikipedia.org/wiki/Riemann_zeta_function). (Technically,
>> NumPy's implementation is limited to the maximum representable 64 bit
>> integer integer, but the intent is to model the infinite support.) The
>> distribution is implemented in SciPy as `scipy.stats.zipf`.
>>
>> A variation of this distribution is the finite Zipf distribution. It
>> has finite support {1, 2, 3, ..., n}, and the PMF is
>>
>>     p(k, a, n) = k**-a / H(n, a), a >= 0, k = 1, 2, ..., n
>>
>> where H(n, a) is the generalized harmonic number. In SciPy, this
>> distribution is implemented in `scipy.stats.zipfian`.
>>
>> I have an implementation of an efficient rejection method for the
>> finite distribution that is currently in a SciPy pull request:
>> https://github.com/scipy/scipy/pull/24011.
>>
>> I think this would make a good addition to NumPy. We wouldn't need a
>> new method; this could be implemented by adding the `n` parameter to
>> the existing `zipf` method of `numpy.random.Generator`, so the
>> signature becomes
>>
>>     zipf(a, size=None, *, n=None)
>
>
> In general, we have been rejecting new "named" distributions to `Generator` 
> now that we have a clean C/Cython API for folks to build efficient sampling 
> methods in their own packages. The kinds of additions to `Generator` that we 
> might consider are more along the lines of "utility" building blocks. Among 
> other things, the `Generator` API is being emulated by other Array API 
> implementations, and I want to be conscious of giving them more work to do.
>
> So I guess my question is, is this just a new "named" distribution that 
> happens to be related enough to an existing one that we can shove it into the 
> same method? Or is it fixing a practical problem that prevents folks from 
> using the Zipf distribution in practice? I.e. does ~everyone who 
> theoretically needs a Zipf instead approximate it with the Zipfian because 
> the (implementation-limited) "infinite" support is too unwieldly? If it's the 
> latter, then I think it's a reasonable addition. If it's just the former, 
> though, I'm of a mind to stick to our rubric; scipy.stats is a good place for 
> such an implementation.
>


After thinking about this more, I'm going to focus on the SciPy
enhancement for now, and hold off on advocating for its inclusion in
NumPy. If, when the SciPy work is done, I still think it is worth
adding to NumPy, I'll resume the NumPy proposal.

Having said that, here's a follow-up to Robert's questions.

> So I guess my question is, is this just a new "named" distribution that
> happens to be related enough to an existing one that we can shove it
> into the same method?

Not exactly. Recognizing that our current implementation is, in fact,
a finite Zipf distribution with upper support bound `n=2**63-1`, I'm
proposing that we make the parameter `n` available to the user. That
is, the proposal is to generalize the distribution, so our current
implementation is just a special case.

> Or is it fixing a practical problem that prevents folks from using the
> Zipf distribution in practice? I.e. does ~everyone who theoretically
> needs a Zipf instead approximate it with the Zipfian because the
> (implementation-limited) "infinite" support is too unwieldly?

FWIW, in the wikipedia article on "Zipf's law"
(https://en.wikipedia.org/wiki/Zipf%27s_law), the simplest form of the
distribution has finite upper bound and the exponent is 1. NumPy's
`zipf` does not cover this case.

The article also mentions the generalization known as the
Zipf-Mandelbrot distribution, and the article claims that when fitting
that distribution to the frequency of words in a body of text, the
exponent is typically close to 1. So it is not inconveivable that
fitting a model to such data could end up with the exponent being less
than 1. But that is not possible with the NumPy's `zipf`.

As an anecdotal data point, I got into working on this because of
https://github.com/scipy/scipy/issues/23487, where a user was trying
to use `zipfian` with values of `a` less than 1, and encountered the
poor performance of SciPy's existing random variate generator.

Warren
_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/numpy-discussion.python.org
Member address: [email protected]

Reply via email to