Hello Alik,
So I would be in favor of expanding the documentation but not
restricting the parameter beyond avoiding value 1.0.
I have removed restriction and expanded documentation in attaching patch v5.
I've done some math investigations, which consisted in spending one hour
with Christian, a statistician colleague of mine. He took an old book out
of a shelf, opened it to page 550 (roughly in the middle), and explained
to me how to build a real zipfian distribution random generator.
The iterative method is for parameter a>1 and works for unbounded values.
It is simple to add a bound. In practice the iterative method is quite
effective, i.e. number of iterations is typically small, at least if the
bound is large and if parameter a is not too close to 1.
I've attached a python3 script which implements the algorithm. It looks
like magic. Beware that a C implementation should take care of float and
int overflows.
# usage: a, #values, #tests
sh> zipf.py 1.5 1000 1000000
# after 1.7 seconds
c = [391586, 138668, 75525, 49339, 35222, 26621, ...
... 11, 13, 12, 11, 16] (1.338591 iterations per draw)
sh> zipf.py 1.1 1000 1000000
# after 3.1 seconds
c = [179302, 83927, 53104, 39015, 30557, 25164, ...
... 82, 95, 93, 81, 80] (2.681451 iterations per draw)
I think that this method should be used for a>1, and the other very rough
one can be kept for parameter a in [0, 1), a case which does not make much
sense to a mathematician as it diverges if unbounded.
--
Fabien.
#! /usr/bin/env python3
#
# generate Zipf distribution
#
# method taken from:
# Luc Devroye,
# "Non-Uniform Random Variate Generation"
# p. 550-551.
# Springer 1986
#
# the method works for an infinite bound, the finite bound condition has been
# added.
a = 1.1
N = 1000000
M = 1
import sys
if len(sys.argv) >= 3:
a = float(sys.argv[1])
N = int(sys.argv[2])
if len(sys.argv) >= 4:
M = int(sys.argv[3])
# beware, a close to 1 and n small (eg 100) leads to large number of iterations
# i.e. rejection probability is high when a -> 1
# - 1.001: 280
# - 1.002: 139.2
# - 1.005: 55.9
# - 1.010: 28.4
# - 1.020: 14.8
# - 1.050: 6.2
# - 1.100: 3.5
# however if n is larger the number of iterations decreases significantly
from random import random
from math import exp
def zipfgen(a, N):
assert a > 1.0, "a must be greater than 1"
b = 2.0 ** (a - 1.0)
i = 0 # count iterations
while True:
i += 1
u, v = random(), random()
try:
x = int(u ** (- 1.0 / (a - 1.0)))
t = (1.0 + 1.0 / x) ** (a - 1.0)
# reject if too large or out of bound
if v * x * (t - 1.0) / (b - 1.0) <= t / b and x <= N:
break
except OverflowError: # on u ** ...
pass
return (x, i)
if M == 1:
x, i = zipfgen(a, N)
print("X = %d (%d)" % (x, i))
else:
c = [0 for i in range(0, N)]
cost = 0
for i in range(0, M):
x, i = zipfgen(a, N)
# assert 1 <= x and x <= N, "x = %d" % x
cost += i
c[x-1] += 1
print("c = %s (%f iterations per draw)" % (c, cost/M))
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers