On 09/24/2014 10:45 AM, Fabien COELHO wrote:
Currently these distributions are achieved by mapping a continuous
function onto integers, so that neighboring integers get neighboring
number of draws, say with size=7:

    #draws     10 6 3 1 0 0 0  // some exponential distribution
    int drawn   0 1 2 3 4 5 6

Although having an exponential distribution of accesses on tuples is quite
reasonable, the likelyhood there would be so much correlation between
neighboring values is not realistic at all. You need some additional
shuffling to get there.

I don't understand what that pseudo-random stage you're talking about is. Can
you elaborate?

The pseudo random stage is just a way to scatter the values. A basic
approach to achieve this is "i' = (i * large-prime) % size", if you have a
modulo. For instance with prime=5 you may get something like:

    #draws     10 6 3 1 0 0 0
    int drawn   0 1 2 3 4 5 6 (i)
    scattered   0 5 3 1 6 4 2 (i' = 5 i % 7)

So the distribution becomes:

    #draws     10 1 0 3 0 6 0
    scattered   0 1 2 3 4 5 6

Which is more interesting from a testing perspective because it removes
the neighboring value correlation.

Depends on what you're testing. Yeah, shuffling like that makes sense for a primary key. Or not: very often, recently inserted rows are also queried more often, so that there is indeed a strong correlation between the integer key and the access frequency. Or imagine that you have a table that stores the height of people in centimeters. To populate that, you would want to use a gaussian distributed variable, without shuffling.

For shuffling, perhaps we should provide a pgbench function or operator that does that directly, instead of having to implement it using * and %. Something like hash(x, min, max), where x is the input variable (gaussian distributed, or whatever you want), and min and max are the range to map it to.

I must say that I'm appaled by a decision process which leads to such
results, with significant patches passed, and the tiny complement to make
it really useful (I mean not on the paper or on the feature list, but in
real life) is rejected...

The idea of a modulo operator was not rejected, we'd just like to have the infrastructure in place first.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to