I certainly don't claim that it is appropriate to force customers into a
full analysis, particularly if random sampling versus a full scan of the
data reveals little to no performance differences in the plans.  Being
able to sample accurately is _very nice for large tables.

For our testing purposes, however, consistent results are extremely
important. We have observed that small difference in one plan for one of
22 queries can cause a difference in the DBT-3 results.  If this
happens, a small change in performance runs between two Linux kernels
may appear to be due to the kernels, when in fact it is due to the plan

We know that the plans are _exactly the same if the data in the
pg_statistics table is the same from run to run (all other things being
equal).  So what we need to have is identical optimizer costs
(pg_statistics) for the same table data for each.

I feel certain that the pg_statistics table will be identical from run
to run if analyze looks at every row.   Thus our hope to find a way to
get that.

We did runs over night.  We can confirm that VACUUM FULL ANALYZE does
not produce the same pg_statistics run to run.  With the default (10)
default_statistics_target the plans are also different.

We ran additional tests with default_statistics_target set to 1000 (the
max I believe).  The plans are the same over the different runs, but the
pg_statistics table has different cost values.  The performance results
of the runs are consistent (we would expect this with the same plans). 
The resulting performance metrics are similar to the best plans we see
using the default histogram size (good news).

However, we worry that one day the cost will change enough for whatever
reason to cause a plan change, especially for a larger database scale
factor (database size/row size). 

I know we appear to be an isolated case, but customers also do testing
and may have the same consistency issues we have.  I can also imagine
cases where customers want to guarantee that plans stay the same
(between replicated sites, for example).  If two developers are
analyzing changes to the optimizer, don't you want the costs used for
testing on their two systems to be identical for comparison purposes?

Anyway, IMHO I believe that an option for an ANALYZE FULL ("sampling"
all rows) would be valuable.   Any other ideas for how to force this
without code change are very welcome.  

Thanks for your info!

On Thu, 2003-09-04 at 16:16, Tom Lane wrote:
> Mary Edie Meredith <[EMAIL PROTECTED]> writes:
> > Stephan Szabo kindly responded to our earlier queries suggesting we look
> > at default_statistics_target and ALTER TABLE ALTER COLUMN SET
> > These determine the number of bins in the histogram for a given column. 
> > But for a large number of rows (for example 6 million) the maximum value
> > (1000) does not guarantee that ANALYZE will do a full scan of the table.
> > We do not see a way to guarantee the same statistics run to run without
> > forcing ANALYZE to examine every row of every table.  
> Do you actually still have a problem with the plans changing when the
> stats target is above 100 or so?  I think the notion of "force ANALYZE
> to do a full scan" is inherently wrongheaded ... it certainly would not
> produce numbers that have anything to do with ordinary practice.
> If you have data statistics that are so bizarre that the planner still
> gets things wrong with a target of 1000, then I'd like to know more
> about why.
>                       regards, tom lane
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Mary Edie Meredith <[EMAIL PROTECTED]>
Open Source Development Lab

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Reply via email to