Hello Alik,

So I would be in favor of expanding the documentation but not restricting the parameter beyond avoiding value 1.0.

I have removed restriction and expanded documentation in attaching patch v5.

I've done some math investigations, which consisted in spending one hour with Christian, a statistician colleague of mine. He took an old book out of a shelf, opened it to page 550 (roughly in the middle), and explained to me how to build a real zipfian distribution random generator.

The iterative method is for parameter a>1 and works for unbounded values. It is simple to add a bound. In practice the iterative method is quite effective, i.e. number of iterations is typically small, at least if the bound is large and if parameter a is not too close to 1.

I've attached a python3 script which implements the algorithm. It looks like magic. Beware that a C implementation should take care of float and int overflows.

  # usage: a, #values, #tests

  sh> zipf.py 1.5 1000 1000000
  # after 1.7 seconds
  c = [391586, 138668, 75525, 49339, 35222, 26621, ...
       ... 11, 13, 12, 11, 16] (1.338591 iterations per draw)

  sh> zipf.py 1.1 1000 1000000
  # after 3.1 seconds
  c = [179302, 83927, 53104, 39015, 30557, 25164, ...
       ... 82, 95, 93, 81, 80] (2.681451 iterations per draw)

I think that this method should be used for a>1, and the other very rough one can be kept for parameter a in [0, 1), a case which does not make much sense to a mathematician as it diverges if unbounded.

--
Fabien.
#! /usr/bin/env python3
#
# generate Zipf distribution
#
# method taken from:
#   Luc Devroye,
#  "Non-Uniform Random Variate Generation"
#  p. 550-551.
#  Springer 1986
#
# the method works for an infinite bound, the finite bound condition has been
# added.

a = 1.1
N = 1000000
M = 1

import sys
if len(sys.argv) >= 3:
    a = float(sys.argv[1])
    N = int(sys.argv[2])
if len(sys.argv) >= 4:
    M = int(sys.argv[3])

# beware, a close to 1 and n small (eg 100) leads to large number of iterations
# i.e. rejection probability is high when a -> 1
# - 1.001: 280
# - 1.002: 139.2
# - 1.005:  55.9
# - 1.010:  28.4
# - 1.020:  14.8
# - 1.050:   6.2
# - 1.100:   3.5
# however if n is larger the number of iterations decreases significantly

from random import random
from math import exp

def zipfgen(a, N):
    assert a > 1.0, "a must be greater than 1"
    b = 2.0 ** (a - 1.0)
    i = 0 # count iterations
    while True:
        i += 1
        u, v = random(), random()
        try:
            x = int(u ** (- 1.0 / (a - 1.0)))
            t = (1.0 + 1.0 / x) ** (a - 1.0)
            # reject if too large or out of bound
            if v * x * (t - 1.0) / (b - 1.0) <= t / b and x <= N:
                break
        except OverflowError: # on u ** ...
            pass
    return (x, i)

if M == 1:
    x, i = zipfgen(a, N)
    print("X = %d (%d)" % (x, i))
else:
    c = [0 for i in range(0, N)]
    cost = 0
    for i in range(0, M):
        x, i = zipfgen(a, N)
        # assert 1 <= x and x <= N, "x = %d" % x
        cost += i
        c[x-1] += 1
    print("c = %s (%f iterations per draw)" % (c, cost/M))
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to