On Sun, Nov 30, 2025 at 9:11 PM Warren Weckesser via NumPy-Discussion <
[email protected]> wrote:

> NumPy implements the Zipf distribution (also known as the zeta
> distribution), as `random.Generator.zipf(a, size=None)`. This is a
> discrete distribution with infinite support {1, 2, 3, ...}. The PMF of
> the distribution is
>
>     p(k, a) = k**-a / zeta(a), a > 1, k = 1, 2, 3, ...,
>
> where zeta(a) is the Riemann zeta function
> (https://en.wikipedia.org/wiki/Riemann_zeta_function). (Technically,
> NumPy's implementation is limited to the maximum representable 64 bit
> integer integer, but the intent is to model the infinite support.) The
> distribution is implemented in SciPy as `scipy.stats.zipf`.
>
> A variation of this distribution is the finite Zipf distribution. It
> has finite support {1, 2, 3, ..., n}, and the PMF is
>
>     p(k, a, n) = k**-a / H(n, a), a >= 0, k = 1, 2, ..., n
>
> where H(n, a) is the generalized harmonic number. In SciPy, this
> distribution is implemented in `scipy.stats.zipfian`.
>
> I have an implementation of an efficient rejection method for the
> finite distribution that is currently in a SciPy pull request:
> https://github.com/scipy/scipy/pull/24011.
>
> I think this would make a good addition to NumPy. We wouldn't need a
> new method; this could be implemented by adding the `n` parameter to
> the existing `zipf` method of `numpy.random.Generator`, so the
> signature becomes
>
>     zipf(a, size=None, *, n=None)
>

In general, we have been rejecting new "named" distributions
<https://github.com/numpy/numpy/issues/9525#issuecomment-1643003504> to
`Generator` now that we have a clean C/Cython API for folks to build
efficient sampling methods in their own packages. The kinds of additions to
`Generator` that we might consider are more along the lines of "utility"
building blocks. Among other things, the `Generator` API is being emulated
by other Array API implementations, and I want to be conscious of giving
them more work to do.

So I guess my question is, is this just a new "named" distribution that
happens to be related enough to an existing one that we can shove it into
the same method? Or is it fixing a practical problem that prevents folks
from using the Zipf distribution in practice? I.e. does ~everyone who
theoretically needs a Zipf instead approximate it with the Zipfian because
the (implementation-limited) "infinite" support is too unwieldly? If it's
the latter, then I think it's a reasonable addition. If it's just the
former, though, I'm of a mind to stick to our rubric; scipy.stats is a good
place for such an implementation.

-- 
Robert Kern
_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/numpy-discussion.python.org
Member address: [email protected]

Reply via email to