Re: [HACKERS] Gsoc2012 idea, tablesample

Florian Pflug Fri, 11 May 2012 06:55:24 -0700

[ Sorry for the self-reply ]

On May11, 2012, at 13:17 , Florian Pflug wrote:
> Actually, thinking more about this, if the approach sketched above
> turns out to work, then one might be able to get away without remembering
> previously computed TIDs, thus removing a lot of complexity.
> 
> Say, for example, you simply start out with a single random TID tid[0].
> The you produce the sequence of "random" TIDs by setting
> 
>  tid[n+1] = tid[n] + random(1 <= x <= 2*D-1)
> 
> where D is the expected distance between one TID from the sample set
> and the next higher one, i.e. D = 2 * #TIDs / N. (You'd simply stop once
> tid[n] >) #TIDs). This will give you approximately uniformly distributed
> TIDs, I think, and will even return them in physical order. The 100$ question
> is, by how much does this violate the independence requirement, i.e. how
> far are P(tuple X picked) and P(tuple X picked | tuple Y picked) apart?
> Some search through the literature should be able to answer that.


Actually, one can easily do better then that. Say you've determined that
to pick each tuple with probability p_tup, you need to pick each TID with
probability p_tid (where p_tup/p_tid is the TID density, i.e. the probability
of a single TID being live). Now, looping over all TIDs and picking each with
probability p_tid is, of course, just a another (more error-prone) way of
iterating over all tuples and picking each with probability p_tup. But instead
of looping, you can jump from from picked TID to the next. Observe that, after
having picked tid X, the probability of picking X+i as the next tid is
(1 - p_tid)^(i-1) * p_tid, because to pick X+i next you have to skip (i-1) TIDs
and then pick the i-th one. Thus, the following algorithm returns a sorted list
of TIDs which should be uniformly distributed and independent (or so I think,
at least)

  1) Starting with a random tid tid[0], chosen such that
       P(tid[0] = i) = (1 - p_tid)^i * p_tid

  2) Setting tid[n+1] = tid[n] + d, which d > 0 chosen such that
       P(d = i) = (1 - p_tid)^(i-1) * p_tid

  3) Abort if max(tid) is >= #TIDs.

This all hinges on the ability to produce a sufficient accurate estimate of the
TID density p_tup/p_tid, of course.

best regards,
Florian Pflug


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Gsoc2012 idea, tablesample

Reply via email to