[ Sorry for the self-reply ] On May11, 2012, at 13:17 , Florian Pflug wrote: > Actually, thinking more about this, if the approach sketched above > turns out to work, then one might be able to get away without remembering > previously computed TIDs, thus removing a lot of complexity. > > Say, for example, you simply start out with a single random TID tid[0]. > The you produce the sequence of "random" TIDs by setting > > tid[n+1] = tid[n] + random(1 <= x <= 2*D-1) > > where D is the expected distance between one TID from the sample set > and the next higher one, i.e. D = 2 * #TIDs / N. (You'd simply stop once > tid[n] >) #TIDs). This will give you approximately uniformly distributed > TIDs, I think, and will even return them in physical order. The 100$ question > is, by how much does this violate the independence requirement, i.e. how > far are P(tuple X picked) and P(tuple X picked | tuple Y picked) apart? > Some search through the literature should be able to answer that.
Actually, one can easily do better then that. Say you've determined that to pick each tuple with probability p_tup, you need to pick each TID with probability p_tid (where p_tup/p_tid is the TID density, i.e. the probability of a single TID being live). Now, looping over all TIDs and picking each with probability p_tid is, of course, just a another (more error-prone) way of iterating over all tuples and picking each with probability p_tup. But instead of looping, you can jump from from picked TID to the next. Observe that, after having picked tid X, the probability of picking X+i as the next tid is (1 - p_tid)^(i-1) * p_tid, because to pick X+i next you have to skip (i-1) TIDs and then pick the i-th one. Thus, the following algorithm returns a sorted list of TIDs which should be uniformly distributed and independent (or so I think, at least) 1) Starting with a random tid tid[0], chosen such that P(tid[0] = i) = (1 - p_tid)^i * p_tid 2) Setting tid[n+1] = tid[n] + d, which d > 0 chosen such that P(d = i) = (1 - p_tid)^(i-1) * p_tid 3) Abort if max(tid) is >= #TIDs. This all hinges on the ability to produce a sufficient accurate estimate of the TID density p_tup/p_tid, of course. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers