> -----Original Message----- > From: Andrew Dunstan [mailto:[EMAIL PROTECTED] > Sent: Monday, April 25, 2005 3:43 PM > To: josh@agliodbs.com > Cc: pgsql-perform; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks > suggested? > > Josh Berkus wrote: > > >Simon, Tom: > > > >While it's not possible to get accurate estimates from a > >fixed size sample, I think it would be possible from a > >small but scalable sample: say, 0.1% of all data pages on > >large tables, up to the limit of maintenance_work_mem.

## Advertising

Note that the results obtained in the cited paper were obtained from samples of 5 and 10%. It should also warrant caution that the authors don't offer any proofs of confidence bounds, even for the "average" case. > [...] > After some more experimentation, I'm wondering about some > sort of adaptive algorithm, a bit along the lines suggested > by Marko Ristola, but limited to 2 rounds. One path might be to use the published algorithm and simply recompute the statistics after every K blocks are sampled, where K is a reasonably small number. If it looks like the statistics are converging on a value, then take a few more samples, check against the trend value and quit. Otherwise continue until some artificial limit is reached. > The idea would be that we take a sample (either of fixed > size, or some small proportion of the table), see how well > it fits a larger sample (say a few times the size of the > first sample), and then adjust the formula accordingly to > project from the larger sample the estimate for the full > population. Math not worked out yet - I think we want to > ensure that the result remains bounded by [d,N]. The crudest algorithm could be something like the Newton- Ralphson method for finding roots. Just adjust the predicted value up or down until it comes within an error tolerance of the observed value for the current sample. No need to choose powers of 2, and I would argue that simply checking every so often on the way to a large sample that can be terminated early is more efficient than sampling and resampling. Of course, the crude algorithm would almost certainly be I/O bound, so if a more sophisticated algorithm would give a better prediction by spending a few more CPU cycles on each sample block gathered, then that seems like a worthwhile avenue to pursue. As far as configuration goes, the user is most likely to care about how long it takes to gather the statistics or how accurate they are. So it would probably be best to terminate the sampling process on a user-defined percentage of the table size and the minimum error tolerance of the algorithmic prediction value vs. the computed sample value. If someone wants a fast and dirty statistic, they set the row percent low and the error tolerance high, which will effectively make the blocks read the limiting factor. If they want an accurate statistic, they set the row percent as high as they feel they can afford, and the error tolerance as low as they need to in order to get the query plans they want. __ David B. Held Software Engineer/Array Services Group 200 14th Ave. East, Sartell, MN 56377 320.534.3637 320.253.7800 800.752.8129 ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly