NumPy implements the Zipf distribution (also known as the zeta
distribution), as `random.Generator.zipf(a, size=None)`. This is a
discrete distribution with infinite support {1, 2, 3, ...}. The PMF of
the distribution is

    p(k, a) = k**-a / zeta(a), a > 1, k = 1, 2, 3, ...,

where zeta(a) is the Riemann zeta function
(https://en.wikipedia.org/wiki/Riemann_zeta_function). (Technically,
NumPy's implementation is limited to the maximum representable 64 bit
integer integer, but the intent is to model the infinite support.) The
distribution is implemented in SciPy as `scipy.stats.zipf`.

A variation of this distribution is the finite Zipf distribution. It
has finite support {1, 2, 3, ..., n}, and the PMF is

    p(k, a, n) = k**-a / H(n, a), a >= 0, k = 1, 2, ..., n

where H(n, a) is the generalized harmonic number. In SciPy, this
distribution is implemented in `scipy.stats.zipfian`.

I have an implementation of an efficient rejection method for the
finite distribution that is currently in a SciPy pull request:
https://github.com/scipy/scipy/pull/24011.

I think this would make a good addition to NumPy. We wouldn't need a
new method; this could be implemented by adding the `n` parameter to
the existing `zipf` method of `numpy.random.Generator`, so the
signature becomes

    zipf(a, size=None, *, n=None)

The default n=None preserves the current behavior. When `n` is a
positive integer, variates from the finite distribution are generated.
Because the finite distribution has no convergence limitation for the
normalizing constant, the constraint on `a` in that case can be
broadened to `a >= 0`.

Notes on the implementation are at
https://github.com/WarrenWeckesser/experiments/tree/main/python/numpy/random-cython/docs,
and some validation scripts are in
https://github.com/WarrenWeckesser/experiments/tree/main/python/scipy/test-zipfian-distr.

API-wise, this is a small change to NumPy. Also, the algorithm is a
straightforward implementation of the rejection method (well,
"straightforward" if you are already familiar with the general idea),
so the implementation should be easy to understand and maintain.

Why add it to both SciPy and NumPy? Ideally, this would just be in
NumPy, and SciPy could start using it reliably once the NumPy version
with the enhancement becomes the oldest supported NumPy version. But
the current SciPy implementation of random variate generation when 0
<= a <= 1 is very slow, so I would like to also have this available in
SciPy sooner than later. When the NumPy version with the enhancement
is the oldest version supported by SciPy, SciPy can remove its
implementation and use NumPy.

What do you think?

Warren
_______________________________________________
NumPy-Discussion mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3//lists/numpy-discussion.python.org
Member address: [email protected]

Reply via email to