Re: [HACKERS] ANALYZE sampling is too good

Mark Kirkwood Sat, 07 Dec 2013 16:08:19 -0800

On 08/12/13 12:27, Mark Kirkwood wrote:

On 08/12/13 09:25, Josh Berkus wrote:

On 12/07/2013 11:46 AM, Robert Haas wrote:

Maybe there's some highly-principled statistical approach which could
be taken here, and if so that's fine, but I suspect not.  So what I
think we should do is auto-tune the statistics target based on the
table size.  If, say, we think that the generally useful range for the
statistics target is something like 10 to 400, then let's come up with
a formula based on table size that outputs 10 for small tables, 400
for really big tables, and intermediate values for tables in the
middle.


The only approach which makes sense is to base it on a % of the table.
In fact, pretty much every paper which has examined statistics
estimation for database tables has determined that any estimate based on
a less-than-5% sample is going to be wildly inaccurate.  Not that 5%
samples are 100% accurate, but at least they fit the 80/20 rule.

This is the reason why implementing block-based sampling is critical;
using our current "take one row out of every page" method, sampling 5%
of the table means scanning the whole thing in most tables.  We also
need to decouple the number of MCVs we keep from the sample size.
Certainly our existing sampling algo seems designed to maximize IO for
the sample size.

There's other qualitative improvements we could make, which Nathan Boley
has spoken on.   For example, our stats code has no way to recognize a
normal or exponential distrbution -- it assumes that all columns are
randomly distributed.  If we could recoginze common distribution
patterns, then not only could we have better query estimates, those
would require keeping *fewer* stats, since all you need for a normal
distribution are the end points and the variance.


 From src/backend/commands/analyze.c

* As of May 2004 we use a new two-stage method:  Stage one selects up
  * to targrows random blocks (or all blocks, if there aren't so many).
  * Stage two scans these blocks and uses the Vitter algorithm to create
  * a random sample of targrows rows (or less, if there are less in the
  * sample of blocks).  The two stages are executed simultaneously: each
  * block is processed as soon as stage one returns its number and while
  * the rows are read stage two controls which ones are to be inserted
  * into the sample.

I don't think we always read every block (been a while since I looked at
this code, so I'll recheck). From what I understand this stuff is based
on reasonable research (Vitter algorithm). Not to say it
couldn't/shouldn't be looked at again to improve it - but it is not just
dreamed up with no valid research!


Since I volunteered to recheck :-)

from analyze.c again:


/*--------------------
 * The following choice of minrows is based on the paper
 * "Random sampling for histogram construction: how much is enough?"
 * by Surajit Chaudhuri, Rajeev Motwani and Vivek Narasayya, in
 * Proceedings of ACM SIGMOD International Conference on Management
 * of Data, 1998, Pages 436-447.  Their Corollary 1 to Theorem 5
 * says that for table size n, histogram size k, maximum relative
 * error in bin size f, and error probability gamma, the minimum
 * random sample size is
 *      r = 4 * k * ln(2*n/gamma) / f^2
 * Taking f = 0.5, gamma = 0.01, n = 10^6 rows, we obtain
 *      r = 305.82 * k
 * Note that because of the log function, the dependence on n is
 * quite weak; even at n = 10^12, a 300*k sample gives <= 0.66
 * bin size error with probability 0.99.  So there's no real need to
 * scale for n, which is a good thing because we don't necessarily
 * know it at this point.
 *--------------------
 */
stats->minrows = 300 * attr->attstattarget;

So for typical settings (default_statictics_target = 100), we try to get30000 rows. This means we will sample about 30000 blocks.

Indeed quickly checking with a scale 100 pgbench db and a simple patchto make the block sampler say how many blocks it reads (notepgbench_accounts has 163935 blocks):


bench=# ANALYZE pgbench_branches;
NOTICE:  acquire sample will need 30000 blocks
NOTICE:  sampled 1 blocks
ANALYZE
Time: 16.935 ms

bench=# ANALYZE pgbench_accounts;
NOTICE:  acquire sample will need 30000 blocks
NOTICE:  sampled 30000 blocks
ANALYZE
Time: 10059.446 ms
bench=# \q


Regards

Mark


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE sampling is too good

Reply via email to