On 09/24/2014 10:45 AM, Fabien COELHO wrote:
Currently these distributions are achieved by mapping a continuous
function onto integers, so that neighboring integers get neighboring
number of draws, say with size=7:
#draws 10 6 3 1 0 0 0 // some exponential distribution
int drawn 0 1 2 3 4 5 6
Although having an exponential distribution of accesses on tuples is quite
reasonable, the likelyhood there would be so much correlation between
neighboring values is not realistic at all. You need some additional
shuffling to get there.
I don't understand what that pseudo-random stage you're talking about is. Can
you elaborate?
The pseudo random stage is just a way to scatter the values. A basic
approach to achieve this is "i' = (i * large-prime) % size", if you have a
modulo. For instance with prime=5 you may get something like:
#draws 10 6 3 1 0 0 0
int drawn 0 1 2 3 4 5 6 (i)
scattered 0 5 3 1 6 4 2 (i' = 5 i % 7)
So the distribution becomes:
#draws 10 1 0 3 0 6 0
scattered 0 1 2 3 4 5 6
Which is more interesting from a testing perspective because it removes
the neighboring value correlation.
Depends on what you're testing. Yeah, shuffling like that makes sense
for a primary key. Or not: very often, recently inserted rows are also
queried more often, so that there is indeed a strong correlation between
the integer key and the access frequency. Or imagine that you have a
table that stores the height of people in centimeters. To populate that,
you would want to use a gaussian distributed variable, without shuffling.
For shuffling, perhaps we should provide a pgbench function or operator
that does that directly, instead of having to implement it using * and
%. Something like hash(x, min, max), where x is the input variable
(gaussian distributed, or whatever you want), and min and max are the
range to map it to.
I must say that I'm appaled by a decision process which leads to such
results, with significant patches passed, and the tiny complement to make
it really useful (I mean not on the paper or on the feature list, but in
real life) is rejected...
The idea of a modulo operator was not rejected, we'd just like to have
the infrastructure in place first.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers