> -----Original Message----- > From: Josh Berkus [mailto:[EMAIL PROTECTED] > Sent: Wednesday, April 27, 2005 10:25 AM > To: Andrew Dunstan > Cc: Mischa Sandberg; pgsql-perform; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] [PERFORM] Bad n_distinct estimation; hacks > suggested? > > [...] > Actually, it's more to characterize how large of a sample > we need. For example, if we sample 0.005 of disk pages, and > get an estimate, and then sample another 0.005 of disk pages > and get an estimate which is not even close to the first > estimate, then we have an idea that this is a table which > defies analysis based on small samples.
I buy that. > Wheras if the two estimates are < 1.0 stdev apart, we can > have good confidence that the table is easily estimated. I don't buy that. A negative indication is nothing more than proof by contradiction. A positive indication is mathematical induction over the set, which in this type of context is logically unsound. There is no reason to believe that two small samples with a small difference imply that a table is easily estimated rather than that you got unlucky in your samples. > [...] > Yes, actually. We need 3 different estimation methods: > 1 for tables where we can sample a large % of pages > (say, >= 0.1) > 1 for tables where we sample a small % of pages but are > "easily estimated" > 1 for tables which are not easily estimated by we can't > afford to sample a large % of pages. I don't buy that the first and second need to be different estimation methods. I think you can use the same block sample estimator for both, and simply stop sampling at different points. If you set the default to be a fixed number of blocks, you could get a large % of pages on small tables and a small % of pages on large tables, which is exactly how you define the first two cases. However, I think such a default should also be overridable to a % of the table or a desired accuracy. Of course, I would recommend the distinct sample technique for the third case. > If we're doing sampling-based estimation, I really don't > want people to lose sight of the fact that page-based random > sampling is much less expensive than row-based random > sampling. We should really be focusing on methods which > are page-based. Of course, that savings comes at the expense of having to account for factors like clustering within blocks. So block sampling is more efficient, but can also be less accurate. Nonetheless, I agree that of the sampling estimators, block sampling is the better technique. __ David B. Held Software Engineer/Array Services Group 200 14th Ave. East, Sartell, MN 56377 320.534.3637 320.253.7800 800.752.8129 ---------------------------(end of broadcast)--------------------------- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq